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^ Abstract 

H 

This paper introduces a class of k-nearest neighbor (fc-NN) estimators called bi- 
partite plug-in (BPI) estimators for estimating integrals of non-linear functions of a 
probability density, such as Shannon entropy and Renyi entropy. The density is as- 
sumed to be smooth, have bounded support, and be uniformly bounded from below 
I on this set. Unlike previous /c-NN estimators of non-linear density functionals, the 

proposed estimator uses data-splitting and boundary correction to achieve lower mean 
^ square error. Specifically, we assume that T i.i.d. samples X, G M'^ from the density 

OO are split into two pieces of cardinality M and respectively, with M samples used for 

computing a k-nearest-neighbor density estimate and the remaining samples used 
for empirical estimation of the integral of the density functional. By studying the 
statistical properties of k-NN balls, explicit rates for the bias and variance of the BPI 
T— I estimator are derived in terms of the sample size, the dimension of the samples and 

the underlying probability distribution. Based on these results, it is possible to specify 
. . optimal choice of tuning parameters M/T, k for maximizing the rate of decrease of the 

. ^ mean square error (MSE) . The resultant optimized BPI estimator converges faster and 

^ achieves lower mean squared error than previous fc-NN entropy estimators. In addition, 

^ a central limit theorem is established for the BPI estimator that allows us to specify 

tight asymptotic confidence intervals. 



1 Introduction 

Non-linear functionals of a multivariate density / of the form J g{f{x),x)f{x)dx arise in 
applications including machine learning, signal processing, mathematical statistics, and stat- 
istical communication theory. Important examples of such functionals include Shannon and 
Renyi entropy. Entropy based applications for image matching, image registration and tex- 
ture classification are developed in ^20] El]- Entropy functional estimation is fundamental 
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to independent component analysis in signal processing [32]. Entropy has also been used 
in Internet anomaly detection and data and image compression applications [23]. Sev- 
eral entropy based nonparametric statistical tests have been developed for testing statistical 
models including uniformity and normality [HI [TU]. Parameter estimation methods based 
on entropy have been developed in [7l[37|. For further applications, see, for example, Leon- 
enko etal [26] . 

In these applications, the functional of interest must be estimated empirically from sample 
realizations of the underlying densities. Several estimators of entropy measures have been 
proposed for general multivariate densities /. These include consistent estimators based on 
entropic graphs [l9l|36], gap estimators [13], nearest neighbor distances [IZl EEl EHl HS] , kernel 
density plug-in estimators [U [TTl |3|, [HI HI [16] , Edgeworth approximations [21] , convex risk 
minimization [33] and orthogonal projections 

The class of density-plug-in estimators considered in this paper are based on fc-nearest neigh- 
bor (fc-NN) distances and, more specifically, bipartite k-nearest neighbor graphs over the 
random sample. The basic construction of the proposed bipartite plug-in (BPI) estimator is 
as follows (see Sec. II. A for a precise definition). Given a total of T data samples we split 
the data into two parts of size N and size M, N + M = T. On the part of size M a fc-NN 
density estimate is constructed. The density functional is then estimated by plugging the 
fc-NN density estimate into the functional and approximating the integral by an empirical 
average over the remaining N samples. This can be thought of as computing the estimator 
over a bipartite graph with the M density estimation nodes connected to the N integral ap- 
proximating nodes. The BPI estimator exploits a close relation between density estimation 
and the geometry of proximity neighborhoods in the data sample. The BPI estimator is de- 
signed to automatically incorporate boundary correction, without requiring prior knowledge 
of the support of the density. Boundary correction compensates for bias due to distorted 
fc-NN neighborhoods that occur for points near the boundary of the density support set. 
Furthermore, this boundary correction is adaptive in that we achieve the same MSE rate of 
convergence that can be attained using an oracle BPI estimator having knowledge of bound- 
ary of the support. Since the rate of convergence relates the number of samples T = N+M to 
the performance of the estimator, convergence rates have great practical utility. A statistical 
analysis of the bias and variance, including rates of convergence, is presented for this class of 
boundary compensated BPI estimators. In addition, results on weak convergence (CLT) of 
BPI estimators are established. These results are applied to optimally select estimator tun- 
ing parameters M/T, k and to derive confidence intervals. For arbitrary smooth functions 
g, we show that by choosing k increasing in T with order 0(T~^/^^+'^^), an optimal MSE 
rate of order 0(T~*/^'^^^^) is attained by the BPI estimator. For certain specific functions g 
including Shannon entropy {g{u) = \og{u)) and Renyi entropy {g{u) = u°'~^), a faster MSE 
rate of order 0(((logT)^/T)^/'^) is achieved by BPI estimators by correcting for bias. 
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1.1 Previous work on /c-NN functional estimation 



The authors of |10l HTJ [26], |29] propose /c-NN estimators for Shannon entropy {g{u) = log(M)) 
and Renyi entropy((7('u) = u°^^). Evans etal [13] consider positive moments of the fc-NN dis- 
tances {g{u) = 'u'^, A; G N). Recently, Baryshnikov etal [2] proposed /c-NN estimators for estim- 
ating /-divergence J (f){fQ{x)/f{x))f{x)dx between an unknown density /, from which sample 
realizations are available, and a known density /q. Because /q is known, the /-divergence 
/ 0(/o(a:)//(x))/(x)(ix is equivalent to a entropy functional f g{f{x),x)dx for a suitable 
choice of g. Wang etal [15] developed a fc-NN based estimator of / g{fi{x)/f2{x),x)f2{x)dx 
when both /i and /2 are unknown. The authors of these works jlOl |T71 [13], |35] sestablish 
that the estimators they propose are asymptotically unbiased and consistent. The authors of 
[29j analyze estimator bias for fc-NN estimation of Shannon and Renyi entropy. For smooth 
functions g{.), Evans etal [12j show that the variance of the sums of these functionals of fc-NN 
distances is bounded by the rate 0{k^/T). Baryshnikov etal [2] improved on the results of 
Evans etal by determining the exact variance up to the leading term {ck/T for some constant 
Ck which is a function of k). Furthermore, Baryshnikov etal show that the entropy estimator 
they propose converges weakly to a normal distribution. However, Baryshnikov etal do not 
analyze the bias of the estimators, nor do they show that the estimators they propose are 
consistent. Us ing the results obtained in this paper, we provide an expression for this bias 
in Section 4.4 and show that the optimal MSE for Baryshnikov's estimators is 0(T^^/'^^+'^)). 



In contrast, the main contribution of this paper is the analysis of a general class of BPI es- 
timators of smooth density functionals. We provide asymptotic bias and variance expressions 
and a central limit theorem. The bipartite nature of the BPI estimator enables us to correct 
for bias due to truncation of fc-NN neighborhoods near the boundary of the support set; a 
correction that does not appear straightforward for previous fc-NN based entropy estimat- 
ors. We show that the BPI estimator is MSE consistent and that the MSE is guaranteed to 
converge to zero as T — > oo and — ?■ oo with a rate that is minimized for a specific choice 
of fc, M and as a function of T. Therefore, the thus optimized BPI estimator can be 
implemented without any tuning parameters. In addition a CLT is established that can be 
used to construct confidence intervals to empirically assess the quality of the BPI estimator. 
Finally, our method of proof is very general and it is likely that it can be extended to kernel 
density plug-in estimators, /-divergence estimation and mutual information estimation. 

Another important distinction between the BPI estimator and the fc-NN estimators of Shan- 
non and Renyi entropy proposed by the authors of [101 [13 [2S] is that these latter estimators 
are consistent for finite fc, while the proposed BPI estimator requires the condition that 
k ^ oo for MSE convergence. By allowing k — )■ oo, the BPI estimators of Shannon and 
Renyi entropy achieve MSE rate of order 0(((logT)^/T)^/'^). This asymptotic rate is faster 
than the 0{T-'^/'^) MSE convergence rate of the previous fc-NN estimators |iCTl [T71 [^ 
that use a fixed value of k. It is shown by simulation that BPI's asymptotic performance 
advantages, predicted by our theory, also hold for small sample regimes. 
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1.2 Organization 



The remainder of the paper is organized as follows. Section |2] formulates the entropy es- 
timation problem and introduces the BPI estimator. The main results concerning the bias, 
variance and asymptotic distribution of these estimators are stated in Section [3] and the con- 
sequences of these results are discussed. The proofs are given in the Appendix. The MSE is 
analyzed in Section 4. We discuss bias correction of the BPI estimator for the case of Shannon 
and Renyi entropy estimation in Section [5j Estimation of Shannon MI is briefly discussed in 
Section 6. We numerically validate our theory by simulation in Section [7} Applications to 
structure discovery and dimension estimation are discussed in Sections 8 and 9 respectively. 



A conclusion is given in Section 10 



Notation 

Bold face type will indicate random variables and random vectors and regular type face 
will be used for non-random quantities. Denote the expectation operator by the symbol 
E and conditional expectation given Z by Ez. Also define the variance operator as V[X] = 
E[(X-E[X])2] and the covariance operator as Cow[X, Y] = E[(X-E[X])(Y-E[Y])]. Denote 
the bias of an estimator by B. 



2 Preliminaries 

We are interested in estimating non-linear functionals G{f) of (i-dimensional multivariate 
densities / with support S, where G{f) has the form 

Gif) = J g{fix),x)fix)dfxix) = E[gifix),x)], 

for some smooth function g{f{x),x). Let 25 denote the boundary of S. Here, /i denotes the 
Lebesgue measure and E denotes statistical expectation w.r.t density /. We assume that 
i.i.d realizations {Xi, . . . , Xat, Xat+i, . . . , Xtv+m} are available from the density /. Neither 
/ nor its support set are known. 

The plug-in estimator is constructed using a data splitting approach as follows. The data is 
randomly subdivided into two parts X^r = {Xi, . . . , X^v} and Xm = {X.n+i, ■ ■ ■ , ^n+m} of 
N and M points respectively. In the first stage, a boundary compensated fc-NN density estim- 
ator ffc is estimated at the N points {Xi, . . . , Xat} using the M realizations {X^v+i, . . . , Xtv+m} 
Subsequently, the samples {Xi, . . . ,XAr} are used to approximate the functional G{f) to 
obtain the basic Bipartite Plug-In (BPI) estimator: 

1 ^ 

Giv(f.) = -5^^?(f.(X,),X,). (1) 

i=l 
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As the above estimator performs an average over the N variables Xi of the function g{f{Xi), Xi), 
which is estimated from the other M variables, this estimator can be viewed as averaging 
over the edges of a bipartite graph with N and M nodes on its left and right parts. 

2.1 Boundary compensated /c-NN density estimator 

Since the probability density / is bounded above, the observations will lie strictly on the 
interior of the support set S. However, some observations that occur close to the boundary 
of S will have fc-NN balls that intersect the boundary. This leads to significant bias in the 
fc-NN density estimator. In this section we describe a method that compensates for this bias. 
The method can be interpreted as extrapolating the location of the boundary from extreme 
points in the sample and suitably reducing the volumes of their fc-NN balls. 

Let d{X,Y) denote the Euclidean distance between points X and Y and dfc(X) denote 
the Euclidean distance between a point X and its k-th nearest neighbor amongst the M 
realizations Xat+i, .., Xtv+a/ • Define a ball with radius r centered at X: Sr{X) = {¥ : 
d{X,Y) < r}. The A;-NN region is Sfc(X) = {¥ : d{X,Y) < dfc(X)} and the volume of the 
/c-NN region is Vfc(X) = J^^f^x) "^^^ standard /c-NN density estimator ^30j is defined as 

f (X) - 

' MVfc(X)- 

If a probability density function has bounded support, the /c-NN balls Sk{X) centered at 
points X close to the boundary may intersect with the boundary S, or equivalently Sfc(X) fl 
S'^ 7^ 0, where S'^ is the complement of S. As a consequence, the /c-NN ball volume Vfc(X) 
will tend to be higher for points X close to the boundary leading to significant bias of the 
/c-NN density estimator. 

Let Rk{X) correspond to the coverage value {l+pk)k/M, i. e. , Rk{X) = inf {r : ^^-^ f{Z)dZ = 
{1+Pk)k/M}, where pk = VQ/{k^^^) for some fixed 6 G (2/3, 1). Define 

esc = Nexp{-3k^^-^^). 

Define Nk{X) as the region corresponding to the coverage value (1 + pk)k/M, i.e. Nk{X) = 
{Y : d{X,Y) < Rk{X)}. Finally, define the interior region S/ 

§j = {X e§: Nk{X) n = 0}. (2) 

We show in Appendix B that the bias of the standard /c-NN density estimate is of order 
0((A;/M)(2/'^)) for points X G S/ and is of order 0(1) at points X G § - §j. This 
motivates the following method for compensating for this bias. This compensation is done in 
two stages: (i) the set of interior points On C Xn are identified using variation in /c-nearest 
neighbor distances in Algorithm [T] (see Appendix B for details) and it is show that Jat ^ S — S/ 
with probability 1 — 0{eBc)] and (ii) the density estimator at points in "Bj^ = Xjv — Jat are 
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Figure 1: Detection of boundary points using Algorithm [T] for 2d beta distribution. 

corrected by extrapolating to the density estimates at interior points Jn that are close to 
the boundary points. We emphasize that this nonparametric correction strategy does not 
assume knowledge about the support of the density /. 

For each boundary point Xj G S^r, let X„(j) G be the interior sample point that is closest 
to Xj. The corrected density estimator is defined as follows. 

fA;(X ) = I l''^-^'^ ^-^^ ^ '^^^ (3) 
I ffc(X„(i)) {Xj G Sat} 



3 Main results 

Let Z denote an independent realization drawn from /. Also, define Z_i G S/ to be Z_i = 
argmin^.6S,ci(x,Z). Define h{X) = T^'^/'^\{d + 2)/2)f-^/^{X)tr[V\fiX))]. Denote the n-th 
partial derivative of g{x,y) wrt x by g^'^\x,y). Also, let g'{x,y) := g^^\x,y) and g"{x,y) := 
g^'^\x,y). For some fixed < e < 1, define pi = {{k — 1)/M)(1 — e)eo and pu = {{k — 
1)/M)(1 + e)eoo- Also define ei = 1/(qD'^), where 2) is the diameter of the bounded set S 
and define qi = {{k — 1)/M)ei and qu = {1 + e)eoo- Let p be a beta random variable with 
parameters k, M — k + 1. 



3.1 Assumptions 

(^.0) : Assume that M, N and T are linearly related through the proportionality constant 
dfrac with: < afrac < 1, M = afracT and = (1 — afracjT. {A.l) : Let the density / be 
uniformly bounded away from and finite on the set S, i.e., there exist constants eo, Cqo such 
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Figure 2: A;-NN balls centered around a subsample of 2d uniformly distributed points. Note that 
the original fc-NN balls centered at points close to boundary (red) over spill the boundary. The 
modified fc-NN neighborhoods (black) corresponding to the corrected corrected density estimate 
compensate for the over spill. 

that < eo < /(x) < Cqo < oo Vx G S. (^1.2): Assume that the density / has continuous 
partial derivatives of order 2v in the interior of the set S where v satisfies the condition 
{k/MY^/'^ = o(l/M), and that these derivatives are upper bounded. (^.3): Assume that 
the function g{x,y) has A partial derivatives w.r.t. x, where A satisfies the conditions = 
o(l/M) and 0{{X\{k/Mf/'^ + l/M))/M) = o(l/M). (^.4): Assume that max{6,2A} < 
k <= M. (^.5): Assume that the absolute value of the functional g{x,y) and its partial 
derivatives are strictly bounded away from oo in the range eo < x < eoo for all y. {A.6): 
Assume that sup^.g(q,_g^) \{g^'^'^ /r\f{x,y)\e-^^'-'~'^ < oo, E[sup^.g(p^^p^) \{g^'''^ /r\f{x/p,y)\] < oo, 
for r = 3, A. 

3.2 Bias and Variance 

Below the asymptotic bias and variance of the BPI estimator of general functionals of the 
density / are specified. These asymptotic forms will be used to establish a form for the 
asymptotic MSE. 
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Theorem 3.1. The bias of the BPI estimator Gfc(/) is given by 

where Cs{k,M,N) = E[l|zes-s,}(^/(/(Z_i), Z_i) - (^(/(Z), Z))] = 0{k/Mf/'', and the con- 
stants ci = E[g'{f{Z),Z)h{Z)], C2 = E[p{Z)g"{f{Z),Z)/2]. 

Theorem 3.2. The variance of the BPI estimator GAr(ffc) is given by 

Y[GNm = C4 (^] + C5 (j-) + OieBc) +0^ ^ ^ 



where the constants C4 = V[^(/(Z),Z)] and C5 = V[/(Z)^'(/(Z), Z)] . 

Proof. We briefly sketch the proof here. The above theorems have been stated more generally 
and proved in Appendix D. The principal idea here involves Taylor series expansions of the 
functional g(Jk{X),X) about the true value (?(/(X),X), and subsequently (a) using the 
moment properties of density estimates derived in Appendix A to obtain the leading terms, 
and (b) bounding the remainder term in the Taylor series and showing that it can be ignored 
in comparison to the leading terms. □ 

The leading terms ci(A;/M)^/'^ + C2//1; arise due to the bias and variance of fc-NN density 
estimates respectively (see Appendix A), while the term c^ik^ M, N) arises due to boundary 
correction (see Appendix B). Hencefo rth, we will refer to C3{k,M,N) by C3. It is shown in 



Appendix B that C3 = 0{{k/Mf''^) (130). The term 0{eBc) 

arises from a concentration 

inequality that gives the probability of the event Jat ^ § — §/ as 1 — 0{eBc)- Observe 
that if k increases logarithmically in M, specifically (log(M))^/(^~'^Y/c — )■ 0, then O^esc) = 
o{N/M^) = o(l/T). 

The term c^/N is due to approximation of the integral / g{f{x),x)f{x)dx by the sample 
mean (1/A^) Xlili fi'l/l^j)) ^j)- The term C5/M on the other hand is due to the covariance 
between density estimates f (Xj) and f (Xj), i ^ j. 

The constants C2, C4 and C5 are once again functionals of the form J g{f{x), x)f{x)dfi{x) and 
can be estimated using the proposed BPI estimator ([T|. On the other hand, the constant 
Ci requires estimation of second order partial derivatives of / in addition to estimating the 
density /. The partial derivatives might be estimated using the methods described in [38] , 
Ci could in principle be estimated in this manner. 

To estimate C3, we observe that ||Y - Y_i|| = 0{{k/MY/'^) with probability 1 - 0(iVe(A;)), 
and that Pr(Y G S - S/) = 0{{k/Mf/'^). Let /i = Y - Y_i. Then, 

C3 = E[l|Yes-s,}(^?(/(Y_i),Y_i)-^?(/(Y),Y))] 

= E[l|Yes-s,}^7'(/(Y_i), Y_i)(/(Y) - /(Y_i))] + 0{{k/Mf'') + 0(e(fc)) 
= E[l|x,es-s,}^?'(/(Y_i), Y_i) < V/(Y_i), h>]+ OHk/Mf/") + 0{e{k)). 
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The constant C3 can then be estimated as 

C3 = (1/iV) J2 9'ih{Xn(^)),^n(^)) < V/(X„(.)),X, - X„(,) >, 

where the estimate V/ of the gradient V/ of / might once again be estimated using the 
methods described in [55] . 



3.3 Central limit theorem 

In addition to the results on bias and variance shown in the previous section, it is shown here 
that the BPI estimator, appropriately normalized, weakly converges to the normal distribu- 
tion. The asymptotic behavior of the BPI estimator is studied under the following limiting 
conditions: (a) k/M — )■ 0, (b) A; — )■ 00 and (c) 00. As shorthand, the above limiting 

assumptions will be collectively denoted by A — 0. 

Theorem 3.3. The asymptotic distribution of the BPI estimator GrNiJk) is given by 
.imPrf°^(«J£dM<aUp,.(S<a). 

where S is a standard normal random variable. 

Proof. Define the random variables {YM,i', i = 1, . . . , N} for any fixed M 

(?(ffc(X,),X,)-E[<7(ffc(X,),X,)] 



MA 



Vb(ffe(X,),X,)] 



The key idea here is to recognize that YM,i are exchangeable random variables. Blum 
et.al. [5] showed that for exchangeable mean, unit variance random variables Zi, the 
sum Sat = ■^J^iLi'^i converges in distribution to iV(0, 1) if and only if Cof (Zi,Z2) = 
and Cov{Zl, Z^) = 0. In our case, 

C0V{YM,^,YM,,) = 0(1/M), 

Cov{Yl^,,Yl^^) = 0(1/M). 

As M gets large, we then have that Cov{YM,i,YM,j) — ^ and Cot;(Yf^ j, — > 0. We 
then extend the work by Blum et.al. to show that convergence in distribution to A^(0, 1) 
holds in our case as both A^ and M get large. These ideas are rigorously treated in Appendix 
E. □ 

The CLT for fc-NN estimators of Renyi entropy was alluded to by Leonenko et.al. [T7] by 
inferring from experimental results. Theorem 3.3 establishes the CLT for BPI estimators of 
arbitrary functionals, including Renyi entropy. This result allows one to define approximate 
finite sample confidence intervals on the estimated values of the functionals and define p- 
values . 
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Figure 3: Asymptotics. Variation of density estimate with increasing k and M 




Figure 4: Asymptotics. Variation of plug-in estimate with increasing k, M and N 

4 Analysis of M.S.E 



Theorem 



3. l| implies that k ^ oo and fc/M — )■ in order that the BPI estimator GAr(ffc) be 



asymptotically unbiased. Likewise, Theorem |3. 2 implies that N ^ oo and M — >■ oo in order 



that the variance of the estimator converge to 0. It is clear from Theorem 3A_ that the MSE 
is minimized when k grows in polynomially in M. Throughout this section, we assume that 
k = koM' for some r G (0, 1). This implies that 0{eBc) = 0{Ne{k)) = o(l/M) = o{l/T). 
Figures [3] and |4] illustrate the asymptotic behavior of the density estimate and the plug-in 
estimate with increasing sample size. 
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4.1 Assumptions 



Under the condition k — koM^, the assumptions {A. 2) and {A. 3) reduce to the following 
equivalent conditions: {A.2): Let the density / have continuous partial derivatives of order 
2r in the interior of the set S where r satisfies the condition 2r(l —t)/d> 1. (yi.3): Let the 
functional g{x,y) have A partial derivatives w.r.t. x, where A satisfies the conditions tX > 1. 

4.2 Optimal choice of parameters 

In this section, we obtain optimal values for k,M and N for minimum M.S.E. 
4.2.1 Optimal choice of k 

Theorems III.l and III. 2 provide an optimal choice of k that minimizes asymptotic MSE. 
Minimizing the MSE over k is equivalent to minimizing the square of the bias over k. Define 
C0 — C1 + Cs/{k/My^^. The optimal choice of k is given by 



where [x\ is the closest integer to x, and the constant ko is defined as kg — (|c2|d/2|co|)''+2 
when C0C2 > and as ko — {\c2\/\cq\)'^ when C0C2 < 0. 

Observe that the constants Cq and C2 can possibly have opposite signs. When cqC2 > 0, the 
bias evaluated at kopt is 6(]'M2+3(l+o(l)) where 6q = CQkl^'^+C2/kf). Let kfrac = k^M^ — kopt- 
When C0C2 < 0, observe that CQ{{kfj.ac-^ kopt) / M)'^/'^ + C2/ {kfrac + kopt) is equal to zero. When 
C0C2 < 0, a higher order asymptotic analysis is required to specify the bias at the optimal 
value oik. In particular, 



k^t = argminB(Gjv(ffc)) = L^oM2+dJ, 



(4) 



k 



d 




where the constants are given by 



h, = E[il/2)g"{fiY))h\X)+g\fiY))hoiY)], 
/i2 = E[(2/3)/'(/(Y))/3(Y)] 



and 



hs={l-2/d)E[g%f{Y))f{Y)c{Y)]. 
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The bias evaluated at kopt is then given by b^M'^+di^l + o(l)) where the constant = 

hlk^^ + (/l2 + C2kfrac)/ko + {h + 2Cik frac/ d)kl^'^~^ . 

Even though the optimal choice kopt depends on the unknown density / (via the constant fco), 
we observe from simulations that simply matching the rates, i.e. choosing k = k = M^/*^^+'^\ 
leads to significant MSE improvement. This is illustrated in Section [7j 



4.2.2 Choice of afrac = M/T 

Observe that the MSE of GrNiJk) is dominated by the squared bias (0(M~^/*^^"'"'^^)) as con- 
trasted to the variance {0{1/N + 1/M)). This implies that the MSE rate of convergence is 
invariant to the choice of afrac- This is corroborated by the experimental results shown in 



Fig. 12 



4.2.3 Discussion on optimal choice of k 



The optimal choice of k grows at a smaller rate as compared to the total number of samples M 
used for the density estimation step. Furthermore, the rate at which k/M grows decreases as 
the dimension d increases. This can be explained by observing that the choice of k primarily 
controls the bias of the entropy estimator. For a fixed choice of k and M [k < M), one expects 
the bias in the density estimates (and correspondingly in the estimates of the functional G{f)) 
to increase as the dimension increases. For increasing dimension an increasing number of the 
M points will be near the boundary of the support set. This in turn requires choosing a 
smaller k relative to M as the dimension d grows. 



4.3 Optimal rate of convergence 

Observe that the optimal bias decays as 6q (T2+^)(l + o(l)) when cqC2 > and 6^ (T2+3)(l + 
o(l)) when C0C2 < 0. The variance decays as 0(1/T)(1 + o(l)). 



4.4 Comparison with results by Baryshnikov etal 

Recently, Baryshnikov etal [2j have developed asymptotic convergence results for estimators 
of /-divergence G{fo, f) = J f{x)(f){fo{x)/f{x))dx for the case where /o is known. Their 
estimators are based on sums of functionals of fc-NN distances. They assume that they have 
T i.i.d realizations from the unknown density /, and that / and /o are bounded away from 
and 00 on their support. The general form of the estimator of Baryshnikov etal is given by 

1 ^ 

G^(ffe5) = ^$^^/(MX.)), 

i=l 
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where ffcs(Xj) is the standard fc-NN density estimator [31] estimated using the T — 1 samples 
{Xi,..,Xt}-{X,}. 

Baryshnikov etal do not show that their estimator is consistent and do not analyze the 
bias of their estimator. They show that the leading term in the variance is given by Ck/T 
for some constant Ck which is a function of the number of nearest neighbors k. Finally 
they show that their estimator, when suitably normalized, is asymptotically normal. In 
contrast, we assume higher order conditions on continuity of the density / and the functional 
g (see Section 3) as compared to Baryshnikov etal and provide results on bias, variance and 
asymptotic distribution of data-split /c-NN functional estimators of entropies of the form 
G{f) = J g{f{x))f{x)dx. Note that we also require the assumption that / is bounded away 
from and oo on its support. Because we are able to establish expressions on both the bias 
and variance of the BPI estimator, we are able to specify optimal choice of free parameters 
k, N, M for minimum MSE. 

For estimating the functional G{f) = J g{f{x))f{x)dx, the estimator of Baryshnikov can be 
used by restricting /q to be uniform. In Appendix C it is shown that under the additional 
assumption that {A.6) is satisfied hj g = g, the bias of GAr(fA,.5') is 

B(G;v(ffc5)) = Oiik/Ty/'') + Oil/k). (5) 

In contrast. Theorem III. 1 establishes that the bias of the BPI estimator GAr(ffc) decays 
as @{{k/Mf/'^ + l/k) + 0{eBc) and the variance decays as e(l/T). The bias of the BPI 
estimator has a higher exponent (2/(i as opposed to 1/d) and this is a direct consequence of 
using the boundary compensated density estimator in place of f^. 

It is clear from |5] that the estimator of Baryshnikov will be unbiased iff A; — )• oo as T — )■ oo. 
Furthermore, the optimal rate of growth of k is given by A; = T^/^^+''^ , Furthermore, Ck = ©(1) 
and therefore the overall optimal bias and variance of GAr(ffc5') is given by Q(T^^^^^~^'^^) and 
0(T"^) respectively. On the other hand, the optimal bias of the BPI estimator decays as 
6o (T2+d)(l + o(l)) when ciC2 > and 6~(T2+d)(l + o(l)) when ciC2 < and the optimal 
variance decays as 6(1/T). The BPI estimator therefore has faster rate of MSE conver- 
gence. Experimental MSE comparison of Baryshnikov's estimator against the proposed BPI 



estimator is shown in Fig. 12 



5 Bias correction factors 

When the density functional of interest is the Shannon entropy {g{u) = — log(u)) or the 
Renyi -a entropy(5'(ti) = a bias correction can be added to the BPI estimator that 

accelerates rate of convergence. Goria et.al. [26] and Leonenko et.al. [17] developed consistent 
Shannon and Renyi estimators with bias correction. The authors of [29] analyzed the bias 
for these estimators. When combined with the results of Baryshnikov etal, one can easily 
deduce the variance of these estimators and establish a CLT. 

Let be the Shannon entropy estimate G7v(ffcS') with the choice of functional g{x) = 



13 



— log(x). Let ia,s be the estimate of the Renyi a-integral estimate GAr(fA;S') with the choice 
of functional g{x) = Define = H5 + [log(fc — 1) — ^(fc)], where is the digamma 

function, and 1^,5' = [{T{k + (1 — a))/T{k)){k — l)"'~^]~^Ia,s- Also define the Renyi entropy 
estimator to be tla,s = (1 "~ ct)^^ log(Ia.s)- The estimators and tla,s cire the Shannon 
and Renyi entropy estimators of Goria etal [17\ and Leonenko etal [26] respectively. In [29], it 
is shown that the bias of H5 and Ia,s is given by e((fc/T)i/"'), while the variance was shown 
by Baryshnikov etal to be 0(1/T). In contrast, by (pj), the bias of H5 and Ia,s is given by 
Q((k/Ty^^ + (l/k)) ([5]). This can be understood as follows. From the results by [2H], we 

E[Us] = I - [\og{k - 1) - ^{k)] + co,o{k/Ty/' + o{{k/Tf/') (6) 

and 

E[L5] = [iX{k + (1 - a))/V{k)){k - 1)"- + coAk/Tf" + o^^k/Tf/") (7) 

for some functionals of the density Co,o and Co,o. Note that [(r(A; + (l — a))/r(A;))(A; — 1)"^^] = 
1 + 0{l/k) and = log(A; — 1) + 0{\/k) as — )■ cxd. From the above equations, the scale 
factor [{T{k+{l-a))/T{k)){k-lY~^] and the additive factor [log(A; - 1) - ^(A;)] account for 
the 0{l/k) terms in the expressions for bias of H5 and la,Si thereby removing the requirement 
that — 7- 00 for asymptotic unbiasedness. These bias corrections can be incorporated into 
the BPI estimator as follows. 



5.1 Main results 

For a general function g{x, y), if there exist functions gi{k, M) and g2{k, M), such that 

(z) E[giik - l)x/Mp, y)] = g{x, y)g^{k, M) + g^ik, M) + o(l/M), 

(n) ((A; - l)/M)ng\{k - l)x / Mp.y)^^/''-^] = g'ix,y){k/Mf/' + o{{k/Mf''), 

{in) lim gi{k,M) = 1, 

fc— >oo 

(iv) lim g2{k,M) = 0, (8) 

then define the BPI estimator with bias correction as 

^ - GAh)-g2ik,M) 



5.1.1 Bias and Variance 



In addition to the assumptions listed in section 3.1, assume that k = 0((log(M))^/'^^ '^)). Be 



low the asymptotic bias and variance of the BPI estimator with bias correction are specified. 
Theorem 5.1. The bias of the BPI estimator GN.Bci^k) is given by 

M[G^,Bcm = ci ' + C3(A;, M, iV) + o ( ( ^ ) | . (10) 
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Theorem 5.2. The variance of the BPI estimator GN,Bc{^k) is given by 



5.1.2 CLT 

Theorem 5.3. The asymptotic distribution of the BPI estimator GN^Bci^k) is given by 

where S is a standard normal random variable. 

5.1.3 MSE 

Theorem IV. 1 specifies the bias of the BPI estimator, GN,Bc{^k), as Q{(k/My^'^). Theorem 
IV. 2 specifies the variance as G(l/A^ + 1/M). By making k increase logarithmically in M, 
specifically, k = 0((log(M))^/(^"'')) for any value 6 G (2/3, 1), the MSE is given by the rate 
e(((log(T))2/(i-^)/T)^/'^). The BPI estimator therefore has a faster rate of convergence m 
comparison to both Baryshnikov etaFs estimators H5 and Ia,s (MSE = ©(T"^/^^^'^))) and 
Leonenko etaFs and Goria etaFs estimators H5 and Ia,s (MSE = Q{T~'^^^)). Experimental 
MSE comparison of Leonenko's estimator against the BPI estimator in Section V shows the 
MSE of the BPI estimator to be significantly lower. Finally, note that such bias correction 
cannot be applied for general entropy functionals, and the bias correction factors cannot in 
general be incorporated. In the next section, the application of BPI estimators for estimation 
of Shannon and Renyi entropies is illustrated. 

5.2 Shannon and Renyi entropy estimation 

For the case of Shannon entropy {g{u) = — log(M)), it can be verified that gi{k,M) = 1, 
g2{k, M) = 'i/j{k)—\og{k — l) satisfy ([s]). Similarly, for the case of Renyi entropy {g{u) = 
giik, M) = (r(A;)/r(A; + 1 - a)){l/{k - g^i^k, M) = satisfy (g. 

For Shannon entropy {g{u) = — log('u)) and Renyi entropy {g{u) = m"^^), the assumptions in 



Section 3.1 reduce to the following under the condition k = 0((log(M))^/*^^^'^)). Assumption 
(^.1) is unchanged. Assumption (^1.2) holds for any r such that 2r > d. The assumption 
(yi.3) is satisfied by the choice of A = log(M). Assumption (^1.4) holds for {g{u) = — log(M)) 
and {g{u) = Next, it will be shown that (^.5) is also satisfied by {g{u) = — log(M)) 

and {g{u) = m""^). 
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We note that g = (fi'^'^V^)^ for the choice of g{u) = — log(u) is given hj g = cu ^ for some 
constant c. Therefore, 

sup \~g{x,y)\e-^'''"-'' = \ce^%M/kfO{e-''''"-'') 

= |cer'|(M/A;)^0(e-=^('°s(M))^) 

= |cej;'^|0(e"^(^°^^^^^^'+^'°*^(*^)"®'°§^''^) = o(l), 



andby(|6g),E[sup,g(p,,p^)|^(x/p,|/)|] = |c|((l-e)eo)-6E[(Mp/(A;-l))6] = |c|((l-e)eo)-'0(l) : 
0(1). Similarly, g = (g^^y{\\)y for the choice of g{u) = — log(u) is given by ^ = A^^m"^'^. 
Then, 

sup \~g{x,y)\e-'''"''' = 0{{M/kf\-^^''-'') 

= 0((M/A;)2^e-3(i°s(^^))') 

= o(e"^('°s(M))2+2(log(A^))2-2log{A^)log(fc)^J ^ ^^^^^ 

and by g, E[sup^g(p^_p^) |^(x/p, |/)|] = 0(E[(Mp/(A; - 1))^^)] = 0(1). In an identical 
manner, (A.b) is satisfied when g{u) = u°'~^. 



To summarize, for functions g{u) = — log(u) and g{u) = u"~^, Theorem 5.1, 5.2 and 5.3 
hold under the following assumptions: (i) (A.O), (ii) (^.1), (iii) the density / has bounded 
continuous partial derivatives of order greater than d and (iv) k = 0((log(M))^/*^^~'^)). Fur- 
thermore the proposed BPI estimator GN,Bc{ik) can be used to estimate Shannon entropy 
(g(u) = - \og{u)) and Renyi entropy {g{u) = m^-^) at MSB rate of e(((log(T))2/(i-^)/r)^/'='). 



6 Estimation of Shannon Mutual information 

The joint entropy of random vectors X and Y with joint density fxY is given by 

H{X,Y) = - j fxYlog{fxY)dfi, (11) 

where fxv is the joint density of X and Y. The Shannon MI between two random vectors 
X and Y is then given by 

J(X;Y) =if(X) + if(Y)-if(X,Y). (12) 

We use the following BPI estimator to estimate Shannon MI from N + M d-dimensional i.i.d 
samples {(Xi, Yi); i = 1, . . . , + M} of the underlying joint density fxY- We estimate the 
Shannon MI by estimating the individual entropies. We estimate the joint Shannon entropy 
if (X, Y) from samples using the plug-in estimate 

1 ^ 

H(X, Y) = - ^ - log(fk(Xi, YO) + log(fc - 1) - ^{k), (13) 

i=l 
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where fxY is a nearest neighbor density estimate (fcNN) estimated using the remaining M 
samples. 

The fcNN density estimate [30] is given by 

where Vk(X, Y) is the volume corresponding to the fcth nearest neighbor distance between the 
point of density estimation (X, Y) and the M i.i.d samples {(Xi, Yi); i = + 1, . . . , N + M}. 

We estimate the marginal entropies by first obtaining estimates of the marginal density using 
fcNN density estimates 

where Vk(X) is the volume corresponding to the fcth nearest neighbor distance between the 
point of density estimation X and the M i.i.d samples {Xi; i = iV + 1, . . . , iV + M}, and then 
plugging the estimated marginals into Eq. 16 

1 ^ 

^(^) = nY.- log(fk(Xi)) + log(fc - 1) - ij{k). (16) 

i=l 

Define the BP I estimator of Shannon MI: 

li, = H(X) + H(Y) - H(X, Y). (17) 

We make the following assumptions: (i) (^1.0), (ii) (^.1), (iii) the density fxv has bounded 
continuous partial derivatives of order greater than d and (iv) k = 0((log(Af))^/*^^^^)). Note 
that the results here require cross moments between density estimates of the joint and mar- 
ginal densities, which while not discussed in this report, can be obtained in exactly the same 
manner as computing cross moments between the same density. 

Theorem 6.1. The bias of the BPI estimator is given by 

ky 



= ci(-) +Csik,M,N) + o\ (-) ). (18) 




Theorem 6.2. The variance of the BPI estimator In is given by 
where 

7x(X)/y(Y) 



Var 



log 



/xy(X,Y) 
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Figure 5: Comparison of theoretically predicted bias of BPI estimator G^vlffc) against ex- 
perimentally observed bias as a function of k. The Shannon entropy {g{u) = — log(u)) is 
estimated using the BPI estimator G^{fk) on T = 10^ i. i. d. samples drawn from the d = 3 
dimensional uniform-beta mixture density (19). N,M were fixed as = 3000, M = 7000 
respectively. The theoretically predicted bias agrees well with experimental observations. 
The predictions of our asymptotic theory therefore extend to the finite sample regime. The 
theoretically predicted optimal choice of kopt = 52 also minimizes the empirical bias. 



6.0.1 CLT 

Theorem 6.3. The asymptotic distribution of the BPI estimator 1^ is given by 



Hm Pr I ^JL^£M <a\= Pr(S < a), 



V[I 



N 



where S is a standard normal random variable. 



7 Simulations 

Here the theory established in Section 3 and Section 4 is validated. A three dimensional 
vector X = [Xi, X2, X^]'^ was generated on the unit cube according to the i.i.d. Beta plus 
i.i.d. uniform mixture model: 

3 

/(xi, X2, X3) = (1 - e) JJ fa,bi^i) + e, (19) 
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Figure 6: Comparison of theoretically predicted bias of BPI estimator GN,Bc{^k) against 
experimentally observed bias as a function of k. The Shannon entropy {g{u) = — log(u)) is 
estimated using the proposed BPI estimator GN,Bc{^k) on T = 10^ i. i. d. samples drawn 
from the d = 3 dimensional uniform-beta mixture density ( 19 ). N,M were fixed as = 3000, 
M = 7000 respectively. The empirical bias is in agreement with the bias approximations of 
Theorem IV. 1 and monotonically increases with k. 



where fa,b{x) is a univariate Beta density with shape parameters a and b. For the experiments 
the parameters were set to a = 4, 6 = 4, and e = 0.2. The Shannon entropy {g{u) = — log(M)) 
is estimated using the BPI estimators GAr(ffc) and GrN^Bci^k)- 

In Fig. [5} the bias approximations of Theorem III. 1 are compared to the empirically determ- 
ined estimator bias of GAr(ffc). and M are fixed as = 3000, M = 7000. Note that the 
theoretically predicted optimal choice of kopt = 52 minimizes the experimentally obtained 
bias curve. Thus, even though our theory is asymptotic it provides useful predictions for 
the case of finite sample size, specifying bandwidth parameters that achieve minimum bias. 
Further note that by matching rates, i.e. choosing k = k = M^/*^^"'"'^-' = 83 also results in 
significantly lower MSE when compared to choosing k arbitrarily (fc < 10 or A; > 150). In 
Fig. [6} the bias approximations of Theorem IV. 1 are compared to the empirically determined 
estimator bias of GrN^Bci^k)- Observe that the empirical bias, in agreement with the bias 
approximations of Theorem IV. 1, monotonically increases with k. 

In Fig. [jI the empirically determined variance of GAr(ffc) is compared with the variance 
expressed by Theorem III. 2 for varying choices of N and M, with fixed -|- M = 10, 000. 
The theoretically predicted variance agrees well with experimental observations. A Q-Q plot 
of the normalized BPI estimate GAr(ffc) and the standard normal distribution is shown in 
Fig.[8j The linear Q-Q plot validates the Central Limit Theorem III. 3 on the uncompensated 
BPI estimator. To verify that the predicted confidence intervals were indeed as advertised, 
the empirically determined and theoretically predicted confidence intervals were compared 
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Figure 7: Comparison of theoretically predicted variance of BPI estimator GAr(ffc) against 
experimentally observed variance as a function of M. The Shannon entropy {g{u) = — log(M)) 
is estimated using the proposed BPI estimator Gjv(ffc) on T = 10^ i. i. d. samples drawn from 
the d = 3 dimensional uniform-beta mixture density ([l9|. k is chosen to be kopt = koM'^^^'^^'^K 
The theoretically predicted variance agrees well with experimental observations. 



in Fig. 10 The lengths of the predicted confidence intervals are accurate to within 12% of 
the length of the true confidence intervals. 



We additionally show in Fig. 11 a plot of the empirically determined estimator bias (via 



simulation) vs the bias predicted by our theory as a function of sample size T, which matches 
the theoretical prediction. 

For Shannon entropy {g{u) = — log('u)), the uncompensated and compensated BPI estimators 
are related by 

GNMik) = G^(ffc) + log(A; - 1) - ^(A;). 

The variance and normalized distribution of these estimators are therefore identical. Con- 
sequently, Fig. [7| and Fig. [8] also validate Theorem IV. 2 and Theorem IV. 3 respectively. 

Finally, using the CLT, the 95% coverage intervals of the BPI estimator GN,Bc(fk) are shown 
as a function of sample size T in Fig. |9j The lengths of the predicted confidence intervals 
are accurate to within 12% of the true confidence intervals (determined by simulation over 
the range of 80% to 100% coverage - data not shown). These coverage intervals can be 
interpreted as confidence intervals on the true entropy, provided that the constants Ci, ..,05 
can be accurately estimated. 
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Figure 8: Q-Q plot comparing the quantiles of the BPI estimator G]\f(fk) (with g{u) = 
— log(u)) on the vertical axis to a standard normal population on the horizontal axis. The 
Shannon entropy {g{u) = — log(-u)) is estimated using the proposed BPI estimator G^i^k) 
on T = 10*^ i. i. d. samples drawn from the d = 3 dimensional uniform-beta mixture density 



(19). k,N,M are fixed as A; = Kpt = 52, N = 3000 and M = 7000 respectively. The 



approximate linearity of the points validates our central limit theorem 3^ 
7.1 Experimental comparison of estimators 

The Renyi a-entropy {g{u) = is estimated for a = 0.5, with the same underlying 3 

dimensional mixture of the beta and uniform densities defined above. Several estimators are 
compared: Baryshnikov's estimator 1^,5, the fc-NN estimator Ia,s of Leonenko etal [T7], the 



BPI estimator without bias correction G7v(ffc) and 
correction GN,Bc{^k)- The results are shown in Fig. 
BPI estimator GN,Bc(Jk) has the fastest rate of convergence, consistent with our theory. Note 



;he proposed BPI estimator with bias 
12 It is clear from the figure that the 



that, in agreement with our analysis in Section [4^ the bias uncompensated BPI estimator 
GAr(ffc) outperforms Baryshnikov's estimator I^^^. 



8 Application to structure discovery 

Discovering structural dependencies among random variables from a multivariate sample is 
an important task in signal processing, pattern recognition and machine learning. Based on 
dependence relationships, the density function of the variables can be modeled using factor 
graphs. When the sample is highly structured, the corresponding factor graph configuration is 
sparse. Sparse factor graphs correspond to joint multivariate distributions which separate into 
a parsimonious product of few lower dimensional distributions. The inherent low-dimensional 
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Figure 9: 95% coverage intervals of BPI estimator Gjy^BciJk), predicted using the Central 
limit theorem 3.3, as a function of sample size T. The Shannon entropy {g{u) = — log(u)) is 
estimated using the proposed BPI estimator GN,Bc{^k) on T i. i. d. samples drawn from the 
d = 3 dimensional uniform-beta mixture density (19). The lengths of the coverage intervals 



are accurate to within 12% of the empirical confidence intervals obtained from the empirical 
distribution of the BPI estimator. 
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Figure 10: Empirically determined and theoretically predicted coverage envelopes as a func- 
tion of coverage values. There is good agreement between the theoretically predicted and 
empirical coverage intervals. 

nature of this product leads to a compact representation of the variables having sparse factor 
graph configurations. 
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Figure 11: Comparison of theoretically predicted bias of BPI estimator GN,Bc{^k) against 
experimentally observed bias as a function of sample size T. The Shannon entropy {g{u) = 
— log(ti)) is estimated using the proposed BPI estimator GN,Bc{^k) on T i. i. d. samples 
drawn from the d = 3 dimensional uniform-beta mixture density (19). The empirical bias is 



in agreement with the bias approximations of Theorem IV. 1 and monotonically decreases 
with T. 



In practice, these structure dependencies have to be discovered from sample realizations of 
the multivariate distribution. Discovering dependencies when parametric probability density 
models are not known a priori is an important restriction of the above problem. For paramet- 
ric distribution estimates, the errors are of order 0(1/ N) if the true distribution is included 
in the parametric model. If not, a non- vanishing bias will dominate the error yielding an 
even higher error than that of a nonparametric distribution estimate (e.g. kNN estimates). 
In this restricted setting, recourse is therefore taken to nonparametric methods. 

Chow et.al. p] proposed an elegant solution to structure discovery of Markov tree distribu- 
tions and provided a nonparametric algorithm to obtain the optimal tree. Ihler et.al. [22] 
developed the method of nonparametric hypothesis tests for structure discovery. 

Nonparametric methods, while asymptotically consistent, can uncover incorrect factor graph 
structure when estimated from a finite number of samples. This is distinctly true for small 
sample sizes. While consistency is an important qualitative property, there is clearly an 
important motivation for quantitative characterization of performance in structure discovery. 
In this work, we analyze factor graph structure discovery in the finite sample size setting. 

We present a class of fc-nearest neighbor (/cNN) based nonparametric geometric algorithms 
to discover factor graph structure among variables. We provide results on mean square error 
of the nonparametric estimates, which can be optimized over free parameters, thereby guar- 
anteeing improved correct structure discovery. In addition, we provide confidence intervals 



23 




Figure 12: Variation of MSE of fc-nearest neighbor estimator of Leonenko etal [T7] and the 
fc-nearest neighbor estimator of Baryshnikov etal [2\ and BPI estimators with and without 
boundary correction, as a function of sample size T. The Renyi entropy {g{u) = is 
estimated for a = 0.5 using these estimators on T i. i. d. samples drawn from the d = 3 
dimensional uniform-beta mixture density (19). The figure shows that the proposed BPI 
estimator has the fastest rate of convergence. 



on these nonparametric estimates to determine the probability of false error in choosing an 
incorrect structure model. These results are an direct extension of our work on optimized 
nonparametric estimates of divergence measures introduced earlier. 

As a consequence of our statistical analysis, we introduce the notion of dependence-based 
dimension for factor graph models and show that comparing models within the same di- 
mension class is an easier task with lower probability of false error as compared to comparing 
models across different dimensions. 



8.1 Factor graphs 

Factor graphs are bipartite graphs used to represent factorizations of probability density func- 
tions. Consider a set of variables X = {Xi, X2, . . . , Xt} and let {Sj C {Xi, X2, . . . , X„}, j = 
1, . . . , m} be a set of subsets of X. Let g{Xi, . . . , X^) denote a probability density function 
on the random vector X. For the factorization g{Xi, . . . ,Xt) = YYp=i fji^j) of ^^e density 
function, the corresponding factor graph G = (X, F, E) consists of variable vertice's X , 
factor vertices's F = {/i, /2, . . . , fm}, and edges E. The edges in the factor graph depend on 
the factorization as follows: there is an undirected edge between factor vertex fj and variable 
vertex X^ when X^ C Sj. 
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8.2 Factor graph discovery 



Problem statement: Consider a set of factor graphs {gi{Xi, . . . ,XT),i = 1, . . . ,/}. We 
seek to find the factor graph configuration from this set that best models the data. 

The Kullback-Leibler (KL) divergence measure induces a geometry on the space of prob- 
ability distributions. On this induced geometry, we naturally define the best factor graph 
configuration Qo to be the one closest to the actual distribution p{Xi, . . . ,Xt) in terms of 
KL divergence (c.f. [8]). 

go = aigmin KL{p\\gi) = aigmin Hc{p, gi), (20) 
gi gi 

where Hc{p,gi) = — J ploggi is the cross-entropy between p and g^. In practice, these cross- 
entropy terms have to be estimated from the finite data sample. Errors in estimation of 
cross-entropy terms can result in incorrect factor graph discovery. 

The problem considered by [8J is a specific instance of discovering factor graph structure. 
For the class of Markov tree factor graphs considered by 0, the cross entropy reduces to 
a sum of pairwise Shannon mutual information terms between variables with edges in the 
Markov tree. In their work, they empirically estimate the mutual information terms from the 
data using nonparametric estimators which are consistent. However, they do not take into 
account the error in the mutual information estimates when estimated from finite samples. 



8.3 Disjoint factor graph discovery 

In order to illustrate the effect of nonparametric estimation from finite sample size on factor 
graph discovery, we restrict our attention to disjoint factor graphs ([22j). For i = 1, . . . ,/, 
let 

m 

g,{X,,X2,...,XT) = l[p{Sf), (21) 

i=i 

where S'j*'' fl S'j^^ = whenever j ^ k, and p{.) denotes the marginal density function. In this 
case of disjoint factor graphs, the cross-entropy takes the following simple form: 

H,ip,g^) = J2H{sf), (22) 

j 

where H{Sj^^) is the Shannon entropy of the variables S']*'* under the true distribution p. 

For example, consider the disjoint factor graph g{Xi, . . . ,^5) = p{Xi, X2)p{X3)p{X4, X^). 
The cross-entropy for this factor graph is given by Hc{p, g) = H{Xi, X2)+H{X3)+H{X4, X5). 

Consider two disjoint factor graph configurations: (a) n{Xi, . . . ,Xt) = 111=^1 fi^i) (b) 
l{Xi, . . . , Xt) = YYj^=i fi^j)- Denote the dimension of Ri by and Sj by d}j. We note 

that YlT^i d'^r^ — Yl^=i d'f^ — ^- Based on the above formulation, in order to compare the 
two potential factor graph models n and /, we need to compare the respective cross-entropy 
terms. The cross entropy test is stated below. 
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Cross entropy test: The cross entropy test to compare between models n and / is given 
by 

mi m2 

H,{p, n) - H,{p, l) = Y, ^(^^) - E ^("S",) > 0. (23) 

i=l 3=1 

We estimate these entropy terms in the test statistic Hc{p,n) — Hc{p,l) from sample realiz- 
ations using kNN plug-in estimators introduced earlier. 



8.4 Errors in factor graph discovery 

To illustrate the effect of estimation error in factor graph discovery, again consider the two 
factor graph models n{Xi, Xt) = Hil'i fi^i) and l{Xi, Xt) = Hyl^ fi^j). 



The cross entropy test (Eq. 22) between models n and / is Hc{p, n) — Hc{p, I) < 0. We replace 



this optimal cross entropy test with the following surrogate cross entropy test: 



mi m,2 



i=i j=i 

where we estimate entropy terms H{Ri) or H{Sj) using independent realizations of the 
underlying density p. To elaborate, if we have V samples {2L^^\ ■ ■ ■ ,2L^'^^} from the density 
p, we partition these V samples into mi + m2 disjoint subsets of size N + M each. This 
implies that N + M ^ V/ (mi -|- 1712). We then use each subset to estimate entropy using the 
partitioning strategy as discussed earlier. 

Denote the coefficients corresponding to the entropy estimate H{Ri) of the subset of variables 
Ri in the factor graph model n by c„.i, c„.2 and c„.4. Using the theorems established in this 
report, we have the following results: 

Mean: The mean of this surrogate test statistic is then given by 
Ep[if,(p,n) -if,(p,Z)] = H,{p,n)-H,{p,l) 



/ ^ \ VdY ' / ^ \ 2/<' 



i=l ^ ^ j=l 

nil 

+ 5^c„,2A-52%2A. (25) 
i=l j=l 

Variance: The variance of the surrogate test statistic is then given by the sum of the variance 
of the individual entropy estimates (by independence) 



mi m2 



Y,[H,{p, n) - H,{p, I)] = [Yl + Y ^'.4 ) ( ^ ) • (26) 

. i=l i=i 
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Weak convergence: Again, by independence of the individual entropy estimates, we have 
the following weak convergence law 

where Z is standard normal. 



8.5 Discussion 

From the above expressions for the mean, variance and weak convergence law of the surrogate 
test statistic, we make the following observations: 

1. The bias term is dependent on the dimension of the factors of the factor graph models 
dj-"^ and d!j\ The variance term is independent of dimension. Furthermore, it is clear 
that the bias term dominates the MSE as the dimension of the factors grows. 

2. For better performance in discovering factor graph structure using cross entropy tests, 
it is clear that we want the MSE of the surrogate test statistic to be small. A significant 
route to achieving this is to get the bias from each factor graph cross entropy estimate 
in the estimated test statistic to cancel. This is to say, we want 

¥.p[H,{p,n) - Hc{p,l)] ^ H,{p,n)-H,{p,l) 
^Ep[H,{p,n)]-H,{p,n) ^ Ep[H,{p,l)] - H,{p,l) 

"ii / ^ \ ' rni / k\ ^^'^^'^ 

^E^"«Mm) «^ E^'.MmJ (28) 

i=l ^ ^ i=l j=l ^ ^ j=l 

3. This cancellation effect will be maximized when the dimensions of the factor graph 
subsets Ri and Sj match. That is to say, we want mi = m2 and furthermore d^"'' = aj\ 
In this case, the bias from each cross entropy estimate are of the same order and will 
nearly cancel. 

On the other hand, when there is a mismatch in dimension, the bias from one cross 
entropy estimate will dominate the bias from the other cross entropy estimate, resulting 
in significant bias in the surrogate test statistic. 

In both these cases, the variance of the surrogate test statistic will be of the same order 
0{1/N). 

4. This gives rise to notion of multivariate dimension for factor graphs. Index the fac- 
torizations according to the vector E = [ei, 62, e^], where Cj is an integer between 
and T that counts the number of factors of order i, i.e. involving a marginal density 
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over i variables. The dimension E of factor graph configurations partitions the factor 
graphs into equivalence classes having nearly constant cross entropy estimate bias. 

For two factor graph models n and / with dimensions En and Ei, we will refer to n as a 
higher dimensional model relative to / if the last non-zero entry of E^ — Ei is positive. 

5. As discussed earlier, the bias will not be a significant factor when comparing models 
over an equivalence class having fixed values of E. On the other hand, the bias will 
be significant when comparing models across different values of E, resulting in higher 
probability of error in factor graph discovery. 

6. Prior knowledge of the equivalence class will therefore translate into much improved 
performance in factor graph discovery as compared to prior knowledge that mixes 
between equivalence classes. 

7. We note that the number of samples required to maintain a constant level of bias grows 
geometrically with dimension E. 

8. Using the expressions for the bias and variance of the surrrogate test statistic, we can 
optimize over the free parameters: (a) the choice of partition N and M for fixed total 
sample size N + M and (b) the choice of bandwidth parameter k, for minimum MSE. 

9. Using the weak convergence law, we can theoretically predict the probability of choosing 
model n over model I using the surrogate cross entropy test. 

8.6 Experiment 

We illustrate the implications of our analysis with a toy example. Let fi3{x, a, b, d) denote a 
beta density of dimension d with parameters a and b. Now let f^{x,d) = 0.5fi3{x,5,2,d) + 
0.5//3(x, 2, 5, d) be a mixture of beta densities. When d > 1, the mixing of densities ensures 
there is strong dependence between the variates. 

We draw V = 10^ independent sample realizations from the joint density p{Xi, . . . ,Xr,) = 
/^(Xi, 1)UX2, l)^(X3, 1)^(X4, Xs, 2). 





E 


True 


False 


1 


[1,0,0,1,0] 


/(Xi,X2,X4,X5)/(X3) 


/(Xi,X2,X3,X4)/(X5) 


m 


[1,2,0,0,0] 


f(Xi.X2)f{X^,X,)f(X^) 


/(Xi,X3)/(X2,X4)/(X5) 


n 


[3, 1,0,0,0] 


f{X,.X-,)f{Xi)f{X2)f{X:,) 


f(X2.xo.r{x,)f{x,)f{X5) 



Experiment The table above shows six different factor graph models. We compare each 
true model against each false model. Denote the true models by It, ttit and ht and the 
corresponding false models by Ip, mp and np- We note that the true cross entropy terms 
Hc{p,h) = Hc{p,mT) = Hc{p,nT) and Hc{p,Il) = Hc{p,mL) = Hc{p,nL)- This guaran- 
tees level playing field when comparing each true model against each false model using the 
surrogate cross entropy test. 
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Figure 13: True factor graph representation of the 5-dimensional joint density p{Xi, . . . ,X^) = 

MXi, i)U{X2, i)U{X3, i)^(X4, X5, 2). 

For the surrogate cross entropy test, we set A^ = .2*10^, M = .8*10^ and k = 20. We note 
that the maximum value of mi + m2 for the above set of tests is 8 and that V/8 > {N + M). 
This choice of N and M therefore ensures that there are enough samples V to guarantee 
sufficient number of independent samples for estimating individual entropies (see Section 5). 

The table below lists the probability (experimental/theoretical predictioiij^ of choosing the 
false model over the true model for the various tests. 



Same true vs Same false 


It vs Ip 


niT vs mp 


np vs np 


Error (Exp/Theor) 


0.071/0.032 


0.067/0.066 


0.068/0.028 


High true vs Low false 


It vs mp 


It vs np 


mp vs np 


Error (Exp/Theor) 


0/0 


0/0 


0/0 


Low true vs High false 


rriT vs Ip 


np vslp 


np vs mp 


Error (Exp/Theor) 


0.689/0.732 


0.995/1.000 


0.691/0.665 



Explanation For the class of models above, the set of constants {c„-i,q^.i} are always 
negative. As a result, when comparing a high dimensional model to a low dimensional 
model, the additional bias will strongly tilt the test statistic towards the higher dimensional 
model. As a result, there is a greater chance of detecting the higher dimension model in the 
surrogate cross entropy test, irrespective of whether the higher dimensional model is true or 
false. 

To elaborate, when the high dimensional model is true and the low dimensional model is 
false, the bias will further tilt the test statistic towards the high dimensional model, resulting 
in zero false detections. On the other hand, when the low dimensional model is true, the 
bias in the surrogate test statistic deviates towards the high dimensional model, resulting in 

^The theoretical prediction requires estimation of constants c/.i,Q;2 and ci-^. These constants were es- 
timated from the data using oracle Monte Carlo methods which utilized the true form of the density p. In 
practice, when the true form of p is never known, we adopt methods given by JSS^ to estimate these constants 
from data. 
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a high number of false detections. When we compare factor graph models within the same 
class of dimension, the bias from the cross entropy estimates for each model nearly cancel, 
resulting in a surrogate test statistic with much smaller bias as compared to the above two 
cases. As a result, the number of false detections is correspondingly low when comparing 
models within the same dimension. 

By the same argument, for factor graph models where the set of constants {c„-i,c/.i} are 
positive, we can conclude that the surrogate test statistic will be biased towards lower di- 
mensional models. 

9 Application to intrinsic dimension estimation 

In this work we introduce a new dimensionality estimator that is based on fluctuations of the 
sizes of nearest neighbor balls centered at a subset of the data points. In this respect it is sim- 
ilar to Costa's fc-nearest neighbor (kNN) graph dimension estimator P] and to Farahmand's 
dimension estimator based on nearest neighbor distances [13]. The estimator can also be re- 
lated to the Leonenko's Renyi entropy estimator [27]. However, unlike these estimators, our 
new dimension estimator is derived directly from a mean squared error (M.S.E.) optimality 
condition for partitioned kNN estimators of multivariate density functionals. This guaran- 
tees that our estimator has the best possible M.S.E. convergence rate among estimators in 
its class. Empirical experiments are presented that show that this asymptotic optimality 
translates into improved performance in the finite sample regime. 

9.1 Problem formulation 

Let 'y = {Yi, . . . , Yt} be T independent and identically distributed sample realizations in 
M.^ distributed according to density /. Assume the random vectors in are constrained 
to lie on a d-dimensional Riemannian submanifold S of {d < D). We are interested in 
estimating the intrinsic dimension d. 

9.2 Log-length statistics 

Let 7 > be any arbitrary number and a = '~f/d. Partition the T samples in ^ into two 
disjoint sets X and Z of size [T/2J each. Denote the samples of X as X = {Xi, . . . ,^\t/2\} 
and Z as Z = {Zi, . . . , Z\t/2\}- 

Partition X into 'target' and M 'reference' samples {Xi, . . . ,XAr} and {Xat+i, . . . ,X.\t/2\} 
respectively with N + M = [T/2J. Partition Z in an identical manner. Now consider the 
following statistics based on the partitioning of sample space: 

N 
i=l 
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where Rk(Xi) is the Euchdean k nearest neighbor (fcNN) distance from the target sample Xj 
to the M reference samples {Xat+i, . . . ,X.\t/2\} ■ This partitioning of samples is illustrated 
in Fig. 14 



o N target samples 
X M reference samples 
— kNN edges 




Figure 14: kNN edges on sphere manifold with uniform distribution for d = 2, D = 3, and 
k = 5. 



9.3 Relation to /cNN density estimates 

Under the condition that k/M is small, the Euclidean fcNN distance Rk(Xj) approximates 
the kNN distance on the submanifold S. The kNN density estimate [31] of / at Xj based on 
the M samples Xat+i, . . . , Xat+a/ is then given by 



M QRk(X,)'^ M Vk(XO^ 
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where q is the volume of the unit ball in d dimensions and therefore Vk(Xi) is the volume 
of the fcNN ball. This implies that Lk(3C) can be rewritten as follows: 



N 



i=l 

N 



alog(A; — 1) - 
-«log(QM). 



a 
N 



i=l 



(29) 



As eq. (29) indicates, the log-length statistics is linear with respect to \og{k — 1) with a slope 
of a. This prompts the idea of estimating a (and later d) from the slope of Lk(X) as a 
function of log(fc — 1). 



9.4 Intrinsic dimension estimate based on varying bandwidth k 

Let ki and be two different choices of bandwidth parameters. Let Lki(X) and Li^^lZ) be 
the length statistics evaluated at bandwidths ki and /c2 using data X and Z respectively. A 
natural choice for the estimate of a would then be 



a 



Lk2(2') ~ Lki(X) 
log(A;2 - 1) - log(A;i - 1) 

N 

" + ^EO°gfk.(Z.)-logfk,(X, 



i=l 



a + z/(Ek,(Z)-Ek,(X)), 



where 



1 ^ 



N 
1=1 



and u = — a/log((fc2 — l)/(^i ~ !))• The intrinsic dimension estimate is related to a by the 
simple relation d = 7/a. 
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9.5 Statistical properties of intrinsic dimension estimate 



We can relate the error in estimation of a to the error in dimension estimation as follows: 



d-d = 



7 

= 7 



1 1 
a a 
a — a 



^ [a — a) -\- oia — a). 



Define k = —'yvja^. We recognize that the density functional estimate Ek(X) is in the form 
of the plug-in estimators introduced in this report. Using the results on the bias, variance and 
asymptotic distribution of the density functional estimate Ek(X) established in this report 
and the above relation between the errors A — d and a — a, we then have the following 
statistical properties for the estimate d: 

Estimator bias 



E[d] — d = KCbi 



+ o 




Estimator Vciriance 



V(d) = Ik^c^ 



1 

M 



1 

iV 



Central limit theorem 

Let Z be a standard normal random variable. Then, 



„ . d-Efdl 
lim fr . , 



< a = Pr(Z < a). 



9.6 Optimal selection of parameters 

We have theoretical expressions for the mean square error (M.S.E) of the dimension estimate 
d, which we can optimize over the free parameters /ci, A;2, N and M. We restrict our attention 



33 



to the case k2 — 2k; ki — k. The M.S.E. of d (ignoring higher order terms) is given by 

M.S.E.(d) = (E[d] -d)2 + V[d] 



M 



+ C. (^) . (30) 
where Cb, = k2(2/'^-i), C^^ = k/A and = 2k'^c^. 

Optimal choice of bandwidth 

The optimal value of k w.r.t the M.S.E. is given by 

kopt = L^oM^J. (31) 
where the constant ko = (|C(,2|<i/2|CbJ)*f2. 

Optimal partitioning of sample space 

Under the constraint that N + M = lT/2\ is fixed, the optimal choice of as a function of 
M is then given by 

6+d 

N,,t^lNoMW\, (32) 

where the constant A^o ~ 

9.7 Improved estimator based on correlated error 

Consider the following alternative estimator for a: 

Li^^{X) — Lki(X) 

Oi — 

log(A;2 - 1) - log(fci - 1) 
= a + «:(Ek,(X)-Ek,(X)), 

and the corresponding density estimate d which satisfies 

d — d — -(a — a) + o(a — a), 

where both the length statistics at bandwidths ki and /c2 are evaluated using the same 
sample X. The density functional estimates Ekj^(X) and Ek2(X) will be highly correlated (as 
compared to the independent quantities Eki(X) and Ek2(2.)). This implies that the variance 
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Figure 15: Comparison of theoretically predicted and experimental M.S.E. for varying choices 
of k. The experimental performance of the estimator d is in excellent agreement with the 
theoretical expression and, as predicted by our theory, the modified estimator d significantly 
outperforms d. 



of the difference Ek2(X) — Eki(X) will be smaller when compared to E\^^{Z) — Ek^(X), (while 
the expectation remains the same). 

Since the estimator bias is unaffected by this modification, the variance reduction suggests 
that d will be an improved estimator as compared to d in terms of M.S.E.. In order to obtain 
statisti cal p roperties for the improved estimator d (equivalent to the properties developed in 
Section 



9.5 



for the original estimator d), we need to analyze the joint distribution between 
fki(^i) and fk2(Xj) for two distinct values ki and k2. Our theory, at present, cannot address 
the case of distinct bandwidths ki and k2. 

Since the estimate d has smaller M.S.E. compared to d, M.S.E. predictions for the estimate 
d can serve as upper bounds on the M.S.E. performance of the improved estimate d. 



9.8 Simulations 

We generate T = 10^ samples "B drawn from a. d = 2 mixture density fm = -SfjB + .2/^, where 
//3 is the product of two 1 dimensional marginal beta distributions with parameters a = 2, 
P = 2 and is a uniform density in 2 dimensions. These samples are then projected to a 
3-dimensional hyperplane in by applying the transformation ^ = WB where [/ is a 3 x 2 
random matrix whose columns are orthonormal. We apply our intrinsic dimension estimates 
on the samples 
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Figure 16: Comparison of theoretically predicted and experimental M.S.E. for varying choices 
of M. The experimental performance of the estimator d is in excellent agreement with the 
theoretical expression and, as predicted by our theory, the modified estimator d significantly 
outperforms d. 

Optimal selection of free parameters 

In our first experiment, we theoretically compute the optimal choice of k for a fixed partition 
with M = 3.5 X 10^ and = 1.5 x 10^. We then show the variation of the theoretical 
and experimental M.S.E. of the estimate d and the experimental M.S.E. of the improved 



estimate d with changing bandwidth k in Fig. 15 In our second experiment, we compute 



the optimal partition according to eq. (32) and show the variation of M.S.E. with varying 



choices of partition in Fig. 16 



From our experiments, we see that there is good agreement between our theory and sim- 
ulations. As a consequence, we find the theoretically predicted optimal choices of k,Naiid 
M to minimize the observed M.S.E.. In addition, as predicted by our theory, the modified 
estimator d significantly outperforms d. The theoretically predicted M.S.E. for d therefore 
serves as a strict upper bound for the M.S.E. of the improved estimator d. 



Comparison of dimension estimation methods 

We compare the performance of our proposed dimension estimators to the estimated proposed 
by Frahmand et. al. |14j (denote as df) and Costa et. al. [9] (denote as dj). 

Expressions for the optimal bandwidth k (eq. (|4])) and partition N,M (eq. (32)) depend on 
the unknown intrinsic dimension d and constants C;,^, Cb^ and which depend on unknown 
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10 



— d (Original) 

— d (Improved) 

— df (Faraliniand) 

— dj (Costa) 



10 



X 10 



Figure 17: Comparison of performance of dimension estimates (Solid line: Optimal (optimal 
choice of k,N and M as per eq. ^ and eq. (32)); Dashed line: Suboptimal (fixed k = 20, 
N = T/50, M = [T/2J — A^)): The proposed improved kNN distance estimator outperforms 
all other estimators considered. 



density /. The constants Cb^, c^j and Cy can be estimated from the data using plug-in 
methods similar to the ones used by Raykar et. al. [38j for optimal bandwidth selection for 
kernel density estimation . To establish the potential advantages of our dimension estimators 
we compare an omniscient optimal form of our estimator, for which the true values of these 
constants are known, to a suboptimal form of our estimator that does not know the constants. 

For the optimal estimator, we theoretically compute the optimal choice for k, N and M 
for different choices of total sample size T (sub-sampled from the initial 10^ samples), and 
use these optimal parameters for the estimators d and d. We use this optimal choice of 
bandwidth k for the estimators dj and d^ as well (partitioning not applicable). For the 
suboptimal estimator, we arbitrarily choose the parameters as follows: fixed k = 20, N = 
T/50, M = [T/2J - N. 



The performance of these estimators as a function of sample size T is shown in Fig. 17 
Estimators with optimal choice of parameters are indicated in solid line, and the suboptimal 
estimators are indicated in dashed lines. 

From our experiments we see that the performance of the original estimator d with sub- 
optimal choice of parameters is marginally inferior when compared to the estimator with 
optimal choice of parameters. This does not hold for the other estimators as can be expected 
since the parameters are optimized w.r.t. the performance of d. 

We note that the improved estimator d outperforms all other estimators while the perform- 
ance of our original estimator d is sandwiched between d/ and dj. We conjecture that the 
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Figure 18: Comparison of performance of dimension estimates for anomaly detection in 
Abilene network data. 

performance of dj is superior to d for the same reason that d outperforms d: correlated error 
between different length statistics. 



Anomaly detection in Abilene network data 

Anomalies can be detected in router netowrks by estimating the local dimension at each time 
point and monitoring change in dimension. The data used is the number of packets sent by 
each of the 11 routers on the abiline network between January 1-2, 2005. A sample is taken 
every 5 minutes, leading to 576 samples with an extrinsic dimension pf 11. 



The performance of different dimension estimators is shown in Fig. [T8j We know that sim- 
ulataneous peaks in router traffic should imply strong correlation between the routers and 
therefore lower intrinsic dimension. This behaviour is clearly reflected better by the optimized 
estimator as compared to the estimator of Costa et. al. [0] and Levina and Bickel 



10 Conclusion 

A new class of boundary compensated bipartite k-NN density plug-in estimators was pro- 
posed for estimation of smooth non-linear functionals of densities that are strictly bounded 
strictly away from on their finite support. These estimators, called bipartite plug-in (BPI) 
estimators, correct for bias due to boundary effects and outperform previous fc-NN entropy 
estimators in terms of MSE convergence rate. Expressions for asymptotic bias and variance 
of the estimator were derived estimator in terms of the sample size, the dimension of the 
samples and the underlying probability distribution. In addition, a central limit theorem 
was developed for the proposed BPI estimators. The accuracy of these asymptotic results 
were validated through simulation and it was established that the theory can be used to 
specify optimal finite sample estimator tuning parameters such as bandwidth and optimal 



38 



partitioning of data samples. 

Our theory has two important by-products: (1) We estabhshed similarity between the mo- 
ments of A;-NN density estimates and kernel density estimates. This in turn implies that 
plug-in estimators based on k-NN density estimators and kernel density estimators have 
asymptotically equal rates of convergence. (2) We developed an algorithm for detection and 
correction of density estimates at boundary points for densities with finite support. This 
correction helps reduce the bias of density estimates at the boundaries of the support of the 
density, thereby reducing the overall bias of the plug-in estimators. 

Using the theory presented in the paper, one can tune the parameters of the plug-in estimator 
to achieve minimum asymptotic estimation MSE. Furthermore, the theory can be used to 
specify the minimum necessary sample size required to obtain requisite accuracy. This in 
turn can be used to predict and optimize performance in applications like structure discovery 
in graphical models and dimension estimation for support sets of low intrinsic dimension. 
We applied our theory to the problem of estimating Shannon entropy and Shannon mutual 
information. Furthermore, we used the Shannon entropy estimator to discover structure in 
high dimensional data and to determine the intrinsic dimension of data samples. 
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For the reader's convenience, the notation used in this paper is hsted in the table below. 



Notation 


Description 






BPI estimator 






BPI Gstimator with bias compensaition (joj) 


gi{k,M),g2{k,M) 


Bias correction factors 




S 


SuDDort of dpnsitv f 




a 


HimpTisioTi of siinnort r5 




Cd 


unit ball volume in d dimensions 


r V 


"V" "VT" rz 1 

, ...,Xt,Y,Z} 


T + 2 independent realizations drawn from / 






iXi X/v) 




Xm 


i X. X /vro- A/f r 




Q - 
"/ 


Interior of support 




Tax 


Interior points subset of X/v^ 






Boundary points subset of Xat 






Closest interior point to Z; Z_i = argmiUajgS/ Z) 






-^n(i) £ ^3Nf is the interior sample point that is closest to Xj € "Bat 







Constant; b G (2/3,1) 






Probability of misclassification of a; G S — §/ as interior point 




) 


A:-NN ball radius 






/t-NN ball 




VhiX) 


/c-NN ball volume 




P(X) 


Coverage function 




h(X) 

'■k ) 


A:-NN density estimate 




ii-ix) 

'■k V^*- 1 


Boundary corrected A;-NN density estimate 




^("Vx, v) 


n-th derivative of y) wrt x 




P 


beta random variable with parameters k, M — k + 1 






Proportionality constant; M = ajracT and = (1 — afrac)T 




eO) Coo 


constants such that eo < /(x) < eoo Vx G S 




2v 


Number of times / is assumed to be differentiable 




A 


Number of times g{x, y) is assumed to be differentiable wrt x 




Ci, ..,C5 


Constants appearing in Theorems III.l, III. 2, III. 3 and IV. 1, IV. 2, IV.3 




e(A;) 


Function which satisfies the rate of decay condition &{k) = 0{e~^^^^ 




ku 


kM = {k- 1)/M 






The event P(X) > (1 - pk)kM 






The event P(X) < (1 + Pk)kM 






The event {I - pk)kM < P{X) < {1 + pk)kM 




efe(X) 


Error function ek{X) = ffc(X) - E[fk{X) \ X] 




e(X) 


Error function e{X) = ik{X) - E[ffc(X) | X] 
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Appendices 



A Uniform kernel density estimation 

Throughout this section, we will derive results on moments of the uniform kernel density 
estimates for points in the set S' = {X : Su(X) C S}. This definition implies that the 
density / has continuous partial derivatives of order 2r in the uniform ball neighborhood for 
each X E S' where r satisfies the condition 2r(l — t)/d> 1. This excludes the set of points 
close to the boundary of the support, where the continuity assumption of the density is not 
satisfied. We will deal with these points in Appendix C. 

Let Xi, .., Xm denote M i.i.d realizations of the density f. We will assume that / is continu- 
ously differentiable evrywhere in the interior of the sWe seek to estimate the density at X 
from the M i.i.d realizations Xi, ..,Xm- Let q denote the volume of a unit hyper-sphere in 
d dimensions. The uniform kernel density estimator is defined as follows: 

A.l Uniform kernel density estimator 

The uniform kernel density estimator is defined below. The volume of the uniform kernel is 
given by 



Vu{X) 



k 



(33) 



and the kernel region is given by 




(34) 



lu(X) denotes the number of points faUing in Su{X) 



lu(^) - ^i=i'i-x,eSu(x), 



(35) 



and the uniform kernel density estimator is defined by 



fu(^) 



lu(^) 



(36) 



Mv^ixy 



The coverage of the uniform kernel is defined as 




(37) 
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We observe that lu(-^) is a binomial random variable with parameters M and U (X). Figure 
19 illustrates the uniform kernel density estimate. 




Figure 19: Uniform kernel density estimator. 



A. 2 Taylor series expansion of coverage 

We assume that the density / has continuous partial derivatives of third order 

in a neighborhood of X. For small volumes Vu{X) (which is equivalent to the condition that 
k/M is small), we can represent the coverage function U{X) by using a third order Taylor 
series expansion of / about about X |3T] . 

U{X) = [ f{Z)dZ 

JSu{X) 

= /(X)K(X) + c{X)V^'^'/'{X) + o{V^'+'/'{X)) 
k / k\ / / k\ 
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where c{X) = r(''/^\^^)tr[W^{f{X))]. 



A. 3 Concentration inequalities for uniform kernel density 

Because lu(-^) is a binomial random variable, we can apply standard Chernoff inequalities to 
obtain concentration bounds on the density estimate, lul-'^) is a binomial random variable 
with parameters M and U (X) . 



A. 3.1 Concentration ciround true density 

For <p < 1/2, 

Pr(lu(X) > (1 + p)MU{X)) < e-^^(^)fV4^ (39) 

and 

Pr(lu(X) < (1 - p)MU{X)) < e-^^WfV4_ (49) 
Using the Taylor expansion of coverage, we then have 

PriW > (1 + 0{{k/Mf'))) <~ e-^'^^W/^ (41) 

and 

Pr(fu(X) < {l-p){f{X) + 0{{k/Mf/'^))) <- e-f'^^^W/^ (42) 
This then implies that 

Pr(fu(X) > (1 + p)f{X)) <^ e-^'^•^W/^ (43) 

and 

PriUX) < (1 - p)f{X)) <~ e-^'^^W/^ (44) 

Let X be a random variable with density / independent of the M i.i.d realizations Xi, .., Xm- 
Then, 

Pr(fu(X) > (1 +p)/(X)) = Ex[Pr(fu(X) > (1 +p)/(X))] 



~ e 



p2fc/(x)/4^ 



and 



~ e-^''^/^ (45) 



Pr(f„(X) < (1 -p)/(X)) = Ex[Pr(fu(X) < (1 -p)/(X))] 

< Eh (e-P''=^(^)/4)] 

^ ^e-P"''/\ (46) 
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A. 3. 2 Concentration away from 

We can also bound the density estimate away from as follows: 

Pr(fu(X) = 0) = Ex[Pr(f,(X) = 0] 

= E[(l - [/(X))^] 

= m-{kf{X) + o{k)/M)'^] 

= E[((l - {kf{X) + o(A;)/M)^/('=^W+''('=)))^^W+''«] 

= ~e-^ (47) 



A. 4 Central Moments 

Define the error function of the uniform kernel density, 

e^{X) - f„(X)-E[f,(X)]. (48) 
The probability mass function of the binomial random variable lu(-'^) is given by 

PriUX) = Q = {^^ {U{X)Y^{1 - U{X)y 



\M-L 



Since lu(-^) is a binomial random variable, we can easily obtain moments of the uniform 
kernel density estimate. These are listed below. 

First Moment: 

E[fuWl-/(X) = ^U(X)-nx) 



Second Moment: 



V[f„(X)] = E[e^(X)] 
M 



^^U{X){1-U{X)) 
/(X)l + o(i). (50) 



Higher Moments: For any integer r > 3, 



ne'uiX)] = O(^). (51) 
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A. 5 Covariance 



Let X and Y be two distinct points. Clearly the density estimates at X and Y are not 
independent. We expect the density estimates to have positive covariance if X and Y are 



close and have negative covariance if X and Y are far. This is illustrated in Figure 20 



Original Samples 



Additional Sample: 
IHeTised h 

■ Heviaed /, 

■ Eevieed J, 
■Original /i 

■ Original ji 
■Original 



Figure 20: Covariance between uniform kernel density estimates. 

Observe that the uniform kernels are disjoint for the set of points given by \1'„ := {X, Y} : 
||X — y|| > 2(fc/QM)^/'^, and have finite intersection on the complement of \E'„. Indeed 
we will show that when the uniform balls intersect (and therefore X and Y are close), the 
density estimates have positive covariance and that they have negative covariance when the 



uniform kernels are disjoint. Intersecting and disjoint balls are illustrated in Figure [21 
Define, 

f/(X,F):=E[lze5.(x)W.m]- (52) 



Intersecting balls 

Lemma A.l. For a fixed pair of points {X, Y} G '^u, 
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M samples 

Points of Density Estimation 
Intersecting balls 
Disjoint ball 



Figure 21: Intersecting and disjoint balls. 



Proof. For {X, F} G we have that lz6S'„(x)lzeSu(y) = and therefore f/(X, F) = 0. 
We then have, 



Cov[e^{X),e^{Y)] 



E[(f,(X) - E[f,(X)])(f„(F) - E[f,(F)])] 
M 
k 

M 
k 

M 



,E[(1 U{X)){1 zeSu{Y) - U{Y))] 



^^-E[UeSAX)lzes.(Y)-U{X)U{Y)] 



k^ 



{U{X,Y)-U{X)U{Y)) 



-^[U{XMY)] 



-f{X)f{Y) f 1 
M \M 



□ 



Disjoint balls For {X, y} G there is no closed form expression for the covariance. 
However we have the following lemmas: 
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Let Ru{X) and Ru{y) denote the (constant and equal) radii of the uniform balls respectively. 
Define - Y\\/R^{X)) = V{Su{X) n Su{Y))/Vu{X) where V{Su{X) n Su{Y)) is the 

volume of the intersection of the two balls. 

We observe that, 



m\X-Y\\/R,{X)) = V{S^{X)nSu{Y))/V^{X) 

^ y['^ZeB{0,RuiX))'i-Z&Bi\\Y-X\\,Ru{Y))] 

Vu{X) 

^ ^[lzeB(oj)lzgB(||y-x||/R^(x),i)] 

= 0(1). (53) 
Because / is assumed to be continuous, we have 

U{X.Y) = E[lze5„(x)lze54y)] = [f{X) + o{l)]V{S^{X)nS^{Y)). (54) 
Lemma A. 2. For a fixed pair of points {X, Y} e ^y!^, 



Cov[e^{X),e^{Y)]^0{l/k). 



Proof. 



2^(^'^) = T:^[f{x) + o{i)]v{Su{x)ns,{Y)) 



f{X) + o{l)V{Bxr\BY) 

k Vu{X) 
f{X) + 0(1) 

k 

f{X) 



mX-Y\\/Ry{X)) 



k 

0{l/k) 



mX-Y\\/Ry{X)) + o{l/k) 



Therefore, 



Cov[e^{X),e^{Y)] = E[(f,(X) - E[f„(X)])(f„(r) - E[f,(r)])] 

= ^mX,Y)-U{X)U{Y)) 

M M 
- ^U{X,Y)--UiX)U{Y) 

= o{i/k) -e{i/M) 

= 0{l/k). 
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□ 

Lemma A. 3. 

/ UiX,y)dy^[fiX) + oil)]V^iXf. 

Proof. We note that for U{X,y) ^ 0, we need {X,y} e We therefore have, f{y) — 
f{X)+o{l). 

ju{X,y)dy - j[f{X) + o{l)]V{Su{X)nS^{Y))dy 

= K(x)[/(x) + o(i)] j mx-y\\/Ru{x))dy 

= v^{x)[f{x) + o{i)]K{xY j my\\IRu{x))d{y/Ru{x)) 

= KW[/(X)+o(l)]^ U{\\y\\lR^{X))d{y/R^{X)) 

Cd J 

- [/(X) + o(l)]^^ / md{5). 

Cd J 

The integral J ^!:((5)ci((5) can be shown to be equal to ca for all dimensions d. 
We then have, 

fu{X,y)dy = [f{X) + o{l)]V^{X) 

= 0(1)1 (Ay. 

□ 

Lemma A. 4. Let 71 (^), 12{X) he arbitrary continuous functions. Let Xi,..,Xm,X,Y 
denote M + 2 i.i.d realizations of the density f. Then, 

Co.(7:(X)e„(X),,.(Y)e„(Y)l = CovMX)f(X)M^Wn ^ 
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Proof. 

Co^;[7i(X)eu(X),72(Y)eu(Y)] = E 7i(X)72(Y)(f„(X) - E[f„(X)])(f„(y) - E[f„(y)]) 

^ E[7i(X)72(Y)([/(X, Y) - f/(X)[/(Y))] 



MK(^)K(1^) 
^^^^^E[7i(X)72(Y)[/(X,Y)] 

^^^^^E[7i(X)72(Y)C/(X)C/(Y)] 
7-77. 



= ^ (E[7i(X)/(X)]E[72(Y)/(Y)]) . 



-E[7i(X)72(Y)C/(X,Y)] 



MVS{X) 



MV^(X) 

Now for U{x,y) ^ 0, we need {a;, |/} G We therefore have, ■J2{y)f{y) = 72(a;)/(a;) + 
We then have, 



1 bi{^h2{x)f{x) + o{l)] (^j U{x,y)dy^ dx 



^ j [7i(a;)72(x)f (x) + o(l)] ((/(x) + o{l))V^{xf) dx 

[ll{x)^2{x)f{x) + 0{l)]{f{x) + 0(l))dx 

(E[7i(X)72(X)/^(X)]+o(l)) 



1 

M 
1 

M 

^E[7i(X)72(X)f (X)]+o(l/M). 



□ 



A. 6 Higher cross moments 

Disjoint balls We have the following results concerning higher cross moments for disjoint 
balls: 
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Lemma A. 5. Let q,r be positive integers satisfying q + r > 2. For a fixed pair of points 



CovieliXleliY)) = o(l/M). 

Proof. For a fixed pair of points {X, Y} G ^u'^, the joint probability mass function of the 
functions lu{X),lu{Y) is given by 



V'l;) I'D J 

We also have from chernoff inequalities for binomial random variables that 

Pr((l - p)k < \^{X) < (1 + p)k) = 1 - e-P'^ 
Pr((l - p)k < lu(F) < (1 + p)k) = 1 - e-f'^ 

Denote the high probability event x by (1 — p)k < lu{X),lu{Y) < (1 +p)k. Define iu(-'^), 
lu(^) to be binomial random variables with parameters {U{X),M — q} and {U{Y),M — r} 
respectively. The covariance between powers of density estimates is then given by 

Cov(i^^{X)X{Y)) = ^Cov{ll{X)X{Y)) 
= E |l [P^iW = l^, W = ly) - PriUX) = QPr{lu{Y) = /,)] + 0{e-^'') 

X 

f^{X)f^{Y)llllU^{X)U^{Y) 
^ ki+''(la, X . . . X - g + l){ly X ...xly-r + 1) ^ 

[{Mx...xM-{q + r~ l))Pr(L(X) = = /,) 

-(M X ...xM -q + 1)(M X . . . X M - r + l)Pr{l^{X) = k)Pr{lu{Y) = ly)] 
+ o{l/M) 

n^)ny)^o(^\\x 



5^[(M x...xM-{q + r- l))Pr{Ux) = k, 1^{Y) = ly) 

X 

-{M X ... X M - - 1))(M X ... X M - (r - l))Pr(i^{X) = QPri^Y) = ly)] 
+ o{l/M) 
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[{M X . . . X M - {q + r - 1)) - {M X . . . X M - {q - 1)){M X . . . X M - {r - 1))] 

o(l/M) 

-qrP{X)r{Y) 1 



M \M 

Then, the covariance between the powers of the error function is given by 

Coviel{X),e:{Y)) = Co^((f.(X) - E[UXW, (f„(r) - E[UYW) 



q r 



a=l 6=1 ^ / ^ ^ 

EE (') (j)l(-/(^))'"(-/OT)' + <'(i)lC<'«ffi(-^).f^(i')) 



/r\ (-l)'^a(-l)^6 / 1 



h M \M 



( -f{X)f{Y) \ 

= l{g=l,r=l} I I +0[l/M) 

= o(l/M). 



where the last step follows from the condition that q-\-r > 2. 

□ 

Intersecting balls For {X, Y} e we have the following bounds 

Lemma A. 6. Let 71 (^), 72(^) be arbitrary continuous functions. Let Xi,..,Xm,X,Y 
denote M + 2 i.i.d realizations of the density f . Also let the indicator function 1a„(^, Y) 
denote the event A„ : {X, Y} e For q,r positive integers satisfying q-\-r > 1, 



E[lA„(X,Y)7i(X)72(Y)e^(X)e:,(Y)] = a 



1 

M , 

(55) 
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Proof. For 1a^{X,Y) ^ 0, we have {X,Y} G ^l- Then, 

E[U„(X,Y)7i(X)72(Y)e^jX)e^.(Y)] 
= E[UjX,Y)7i(X)72(Y)Ex,Y[e^u(^)eu(>^)]] 

< E 



E 



lA„(X,Y)7i(X)72(Y)VEx[e^^(X)]EY[e2'^(r)] 

1 



1aJX,Y)7i(X)72(Y)0 
1 



o 

o 
1 

M 



1 



J.q+r/2 

(7i(x)72(x) + o(l)) 



(7i(x)72(x) + o(l)) 



Au{x,y)dy ]dx 
2'^— \dx 



where the bound is obtained using the Cauchy-Schwarz inequahty and using EqjST] □ 

We can succinctly state the resuhs derived in the last two lemmas in the form of the following 
lemma: 

Lemma A. 7. Let 71 (X), 72(-^) be arbitrary continuous functions. Let Xi, .., Xjv/, X, Y 
denote M + 2 i.i.d realizations of the density f . If q,r are positive integers satisfying q + r > 2 



Cot;[7i(X)e^jX),72(Y)e;(Y)] = o(l/M). 



Proof. The result for the case g = 1, r = 1 was established earlier in Lemma A.4 

Cot;[7i(X)eUX),72(Y)e;,(Y)] = I + D, 

where '/' stands for the contribution form the intersecting balls and 'D' for the contribution 
from the dis-joint balls. / and D are given by 

I = E[lA„(X,Y)Cot;[7i(X)e^(X),72(r)e:,(r)]], 
D = E[(l-UjX,Y))CoW7i(X)e^(X),72(l^)e;,(F)]]. 

We have already established in the previous lemma that 

1 



Now, 



M 



D = E[(l-U.(X,Y))7i(X)72(Y)Ex,Y[C7ot;(e^jX),e;,(F))]] 
= E[(l - UJX, Y))7i(X)72(Y)o(l/M)] 
1 

M 



(56) 
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This concludes the proof. 



□ 



B /c-NN density estimation 

In this appendix, moment properties of the standard fc-NN density estimate ffc(X) are derived 
conditioned on Xi, . . . , Xjy. As the samples Xi, . . . , X^v, X^r+i, . . . , X^, T = M + N are i.i.d., 
these conditional moments are independent of the N samples Xi, ..jXtv- 



B.l Preliminaries 

Let (i(X, Y) denote the Euclidean distance between points X and Y and d^^ denote the 
Euclidean distance between a point X and its k-th. nearest neighbor amongst Xtv+i, .., Xat+m- 
Let Cd denote the unit ball volume in d dimensions. The /c-NN region is 

Su{X) = {Y ■.d{X,Y)<df} 

and the volume of the fc-NN region is 



Vfc(X) = / dZ. 

JSk{X) 

The standard fc-NN density estimator [.3_QJ is defined as 



MVfc(X)- 
Define the coverage function as 



P(X) = / f{Z)dZ. 

Define spherical regions 

SriX) = {Y eR'^: d{X, Y) < r}. 



B.2 Concentration inequality for coverage probability 

It has been previously established that P(X) has a beta distribution with parameters k, 
M — k + 1. [31]. Consider a binomial random variable with parameters M and P with 
distribution function Bi{.\M, P) and a beta random variable with parameters k and M — k + 1 
with distribution function Be{.\k, M — k + 1). We have the following identity, 

Be{P\k, M-k + 1) = 1- Bi{k - 1|M, P). (57) 
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The following Chernoff bounds for binomial random variables have also been established 
previously. When k < MP, Bi{k\M,P) < exp[-{MP - kf /2PM], and when k > MP, 
1 - Bi{k\M, P) < exp[-{MP - kf/2PM]. We therefore have that for some < p < 1/2, 

Pr((l -p){k- 1)/M < P(X) < (j9 + l)(fc - 1)/M) = 0(e-P''^/2). (58) 

Define 

kM = {k- 1)/M. 

Let tl(X) denote the event 

P(X) < {pk + l)kM, (59) 
where pk = VQ/{k^'^). Then, 1 - Pr(t](X)) = 0(e-P^'/2) = 0(e-3'=^'""). Equivalently, 

l-Pr(^(X)) = 0(e(A;)), (60) 

where C(/c) is a function which satisfies the rate of decay condition C(fc) = 0(e~^^'^ '''). 
Similarly, let t]_i(X) denote the event 

V{X)>{l-pk)kM, (61) 

Then 

l-Pr(UW)=0(e(fc)), (62) 
Also let t]t](X) = t](X) n t]_i(X). Then 

l-Pr(^^(X))=0(e(A;)), (63) 

Finally, we note that T[x + a)/T{x) = + o(x"). Then for any a < k, E[P^"(X)] exists and 
is given by 



B.2.1 Interior points 

Let S' to be any arbitrary subset of S/ ^ satisfying the condition Pr(Y ^ S') = o(l) where 
Y is random variable with density /. This implies that given the event tl(X), the fc-NN 
neighborhoods Sfc(X) of points X G S' will lie completely inside the domain S. Therefore 
the density / has continuous partial derivatives of order 2z/ in the fc-NN ball neighborhood 
Sfc(X) for each X G S' (assumption (^.2)). We will now derive moments for the interior set 
of points X G S'. This excludes the set of points X close to the boundary of the support 
whose /c-NN neighborhoods Sfc(X) intersect with the boundary of the support. We will deal 
with these points in Appendix B. 
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B.2.2 Taylor series expansion of coverage probability 



Let X G S'. Given the event \\{X), the coverage function P{X) can be represented in terms 
of the volume of the fc-NN ball \k{X) by expanding the density / in a Taylor series about 
X as follows. In particular, for some fixed x G S', let 



P(^) = / fiz)dz. 

JSuix) 

Using {A.2), we can write, by a Taylor series expansion of / around x using multi-index 
notation |39] 



f{z)= ^-^^{dV){x) + o{\\z - x\r) (65) 

oil 

0<\oi\<2v 

Assuming Su{x) C S, we can then write 



= / fi.z)dz 

'Suix) 



,d+2u\ 



= fixy^u" + c^ixY^^^'I'u"^^'^ + o[u'^^^). (66) 

where Cj(x) are functionals of the derivatives of /. Now, denote v{u) = f^^-^dz to be 
the volume of Su{x). Let m*™(w) be the inverse function of v{u). Note that this inverse 
is well-defined since v{u) is monotonic in u. Since Su{x) C S, v{u) = CdU'^. This gives 
M^™(t.) = {v/cdy/"^. Define 



Piv) = / fiz)dz. 



Using (66), 



v-l 



P{y) = /(X)t; + ^Q(X)t;i+''/^ + o(t;^+2"/''). (67) 



i=l 



Now denote V{j)) = P*™(p) to be the inverse of P{.). Note that this inverse is well-defined 



since P{v) is monotonic in v. Dividing (67) by vP{v) on both sides, we get 



- = ^¥\ + y.'-^^"''' + o{v''^l'P'\v)) (6J 



V Piv) ^ Piv 

^ ' i=l 
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By repeatedly substituting the LHS of ( [68| in the RHS of (68), we can obtain (69): 



1 



V{p) 



fjX) 
P 



i=i ^ 



(69) 



From our derivation of ([69|) using (67), it is clear that hi{X) are of the form 



h,(X) 



E 



{ai}=A;AeA 



where A is a z/-tuple of positive real numbers oq, ..,ai,_i and the cardinality of A is finite. 
By assumptions (^1.1) and (^.2), this implies that the constants hi{X) are bounded. Also, we 
note that h{X) = h{X) = c{X)f~^/^{X) [I5], where c(X) := ci(X) = T'^^/^\^)tr[V\f{X))]. 
This then implies that under the event \]{X) 



Vfc(X) P(X) 



te7 



(70) 



where T = {2/d, 4/d, 6/rf.., 2z//4 and hr(X) = o(p2^/^^i(X)). Now, by (yi.2), we have 
{k/Mf"/"^ = o(l/M). This implies that 2z//rf > 1. Under the event t](X), we have P(X) < 
(pfc + l)k/M, which, in conjunction with the condition 2v/d> 1 implies that 

hr(X) = o{V^'''''-\X))=o{{klMf'''''-^) = o{llkMM). (71) 

On the other hand, under the event, \\'^{X), {p^ + l)k/M < P(X) < 1, which gives 

hr(X) = 0(1). (72) 



B.2.3 Approximation to the fc-NN density estimator 

Define the coverage density estimate to be, 

Ux) = f{x)^~^ ^ 



M v{xy 



The estimate ^c{X) is clearly not implementable. Note also that the two estimates - ^c{X) 
and ffc(X) - are identical in the case of the uniform density. 

1 -/(^),,^,H,X, (T3) 



Vfc(X) P(X) pi-2/<i(x) 
where K{X) = o{l/P^-'^/'^{X)). This gives, 

UX) = f.(A-) + (^^) + t^K(X). (74) 

whenever \\{X) is true. 
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B.2.4 Bounds on fc-NN density estimates 

Let X be a Lebesgue point of /, i.e., an X for wliicli 

Because / is an density, we know tliat almost all X G S satisfy the above property. Now, fix 
e G (0, 1) and find 6 > such that 

o<r<s Js^(^,) ay 
This in turn implies that, for P{X) < P{S), 

<V.(A-)<,,J^ (75) 



(l + £)/(.Y) - " (l-e)/(A-) 
and in turn implies 

(l-e)f,(X)< f,(X) <(l + e)f,(X). (76) 

Also, because 5 > is fixed, we note that the event P{X) < P{S) is a subset of \]{X) and 
therefore (75) holds under \]{X). 

Under the event \]'^{X), we can bound Vfc(X) from above by CdT)'^. Also, since Vfc(X) is mono- 
tone in P(X), under the event \\'^{X), we can bound Vfc(X) from below by (1 + pk){k — 1)/M(1 — e)/(X) 
and therefore by {k — 1)/M(1 — e)f{X). Written explicitly. 



< Vfc(X) < CdV" (77) 



M(l-e)/(X) 
and in turn implies 

(A; - 1)/(MqD^) < f,(X) <(l-e)/(X). (78) 

Finally, note that A;Af/P(X) is bounded above by 0(1) under the event \]{X). This implies 
that for any a < k, 

Er{X)]kl,p-%X) < 0{l)PrmX)) = 0{e{k)). (79) 
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B.3 Bias of the /c-NN density estimates 



Let X G S'. We can analyze the bias of /c-NN density estimates as follows by using (74) 

h{X) 1 „ L k-1 



E[l^(x)f.(X)] = E[l^(x)fc(X)] + E 
= E[l^(x)fc(X)] + E 

= E[f,(X)]+E 



k - 1 

M J pi-2/'i(X)_ 
k-l\ h{X) 



E 



4W 



M y pi-2/d(x)_ 



o E 



M 



hs(X) 



M 



k-1 



h{X) 



M J pi-2/rf(X) 



k \ 



2/d 



M 



+oim) 



k \ 



2/d 



k \ 



2/d 



where we used the fact that under the event t]'^(X), {{k — 1)/M)P^ *(X) = 0(1) for any 
t >= 0, which in turn gives E[l^c(x)((fc - 1)/M)pi-*(X)] = 0(Pr(t]^(X))) = 0{e{k)). This 
implies that 

E[ffc(X)] - /(X) = E[l^(x)ffc(X)] + E[l^c(^)f,(X)] - /(X) 

\ 2/"! / ^ \ 2/a! 



+o(-] +o(e(.)), 



'1) 



>= 3 by (64) 



where the last step follows because , by (78), l^c(x)ffc(X) = 0(1). This expression is true for 



derivation of (81 ). 



Next, assuming that (pj) holds, we evaluate E[5f(ffc(X), X)] in an identical fashion to the 



E[l^(x)^7(ffc(X),X)] = E [mX)g (^f,(X) + A;M/i(X)(P(X))2/'^-i + A;Afh,(X),X^ 
= E [l^^x)9 (fc(X) + kMh{X){P{X))'/'-' + A;mo((P(X))2/'^-1),X 
= E [g (f,(X) + kMHXKPiX))'/"-' + fcMo((P(X))2/^-i),x)] + 0(e(fc)) 
= E[^?(f,(X),X)+^?'(f,(X),X)fcM/^(X)(P(X))2/'^-i + o(A;MP(X))2/'^-i)] + 0(e(A;)) 
= g{f{X),X)g,{k, M) + (^^(A:, M) + (?'(/(X), X)/i(X)(A;/M)2/^ + o((fc/M)^/'^) + 0(e(fc)). 

This gives, 

E[(7(f,(X), X)] = E[l^(x)^7(ffc(X), X)] + E[l^c(^)^7(f;^.(X), X)] 

= g{f{X),X)g,{k, M) + (72(A:, M) + g'{f{X),X)h{X){k/Mf/' + ^((fc/M)^/'^) + 0{Qm) 
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B.4 Moments of error function 



Let 7i(X), 72 (X) be arbitrary continuous functions satisfying the condition: supj^ [7j(X)] is 
finite, i = 1,2. Also let 7(X) = 71 (X). Let Xi, .., Xm, X, Y denote M + 2 i.i.d realizations 
of the density /. Let q, r be arbitrary positive integers less than k. Define the error function 

ek{X) = fk{X)-E[h{X) \X]. 

Then, 

Lemma B.l. 

E[l|X6S'}7(X)e^(X)] = Oik-"'/') + o(l/M) + 0{e{k)). (83) 

Lemma B.2. 

C^o4l|xeS'}7i(X)e^(X),l|YeS'}72(Y)e^,(Y)] = ( ^((^^,),Vi)m ) + ^^^'^'/^^ 

+ 0{l/M^) + 0{e{k)). (84) 



Define the operator M(Z) = Z — E[Z]. Let (3 be any positive real number and define 

E,(X) = 4(M(P-^(X))). (85) 



Define the terms 



e,(X) = f,(X)-E[fe(X) |X], (86) 
kAihtiX) , ^g^^ 



\ tea- 



Note that 
and 



^-*(X))_ 

e,(X) = M(A;Mhr(X)). (88) 

ee(X) = /(X)Ei(X) (89) 

e,(X) = (5^A:*AW(Ei-t(X))). (90) 



Define the event {X e S'} n {^(X)} by t(X). Note that under the event t(X), efc(X) = 
ec(X) + ei(X) + e^(X) =: eo(X). Also, under the event t](X), P(X) < (1 +Pk)kM, which 
implies that under the event t](X), the following hold 

E^{X) = 0(1), ee(X) = 0(1), ei(X) = 0(1), e,(X) = 0(1), e^iX) = 0(1). (91) 



Furthermore, by (78), under the event t](X), 

efc(X) = 0(1). (92) 



59 



Proof, of Lemma B.l Since P{X) is a beta random variable, the probability density function 

Ml 



of P{X) is given by 



{k-l)\{M-k)\ 



,M-k 



By (64), E[P-^(X)] = e((A;/M)-^) ii jS < k. We will first show that E[E^(X)] = 0(1) if 
g/3 <k. This in turn implies that, by (g) and (Q, E[e9(X)] = 0(1) and E[ef(X)] = 0(1) 
for any q < k. 

E[E^(X)] = E\kfj{p-^{X)-E[p-'^{X)]y 

Q 

i 



q 



kfj J2 ) (-l)'^"'E[p-*'^(X)]E[p-(«-*)^(X)] 



E 

i=l 



i=l 

i 



M E ! {-ir'Q{{k/Mr^)e{{k/My 



-iy-'Q{i) = 0(1). 



(93) 



By (|63j) and (|93j). 
By the definition of \\\\{X), 
and therefore 
This gives. 



E[l^^.(x)E^(X)] = Oim)- 
l^«x)E^(X) = O , 

E[E^(X)] = 0(A;-^'^/') + 0(e(A;)). 
From this analysis on E^(X), it trivially follows from (89) that 

E[e^(X)] =0{k~^'/^) + 0{e{k)). 



(94) 



Also observe that by ( 71 ) and ( 72 ) , 



E[et(X)] = E[l^(x)eUX)] + E[l^.(;,)et(X)] = o(1/M') + 0(e(fc)). 



(95) 
(96) 
(97) 



We will now bound eJ(X). Let L = Ylt&^t^- ^o"^; using (90), e[{X) can be expressed as a 
sum of terms of the form (fc/M)^(^^ ' Y[^^j{h[{X)'E\ {X)) where ^^Zt = Z. Now, we can 
bound each of these summands using (94) as follows: 



{k/M)%\{ Ej' (X)] = (A;/M)%[l^^(x) \{ E^* (X)] + {k/M)^E[h^.^x) \{ E^* (X)] 

= {k/MfWO{k-''^/^) + 0{Q{k)) 



(A;/M)^0(A;'''^/') + 0(e(A;)) 



(9^ 
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This implies that 



(99) 



Note that e^(X) will contain terms of the form {e^X) + et{X)y {er{X)y^K If Z < g, the 
expectation of this term can be bounded as follows 

|E[(e,(X) + e,(X))'(e,(X))''-']| 



< y'E[(ee(X) + ei(X))2']E[(e,(X))%-0] 

= ^0(l)2'(o(l/M))%-') 

= 0(1) X (o(l/M))^-' = o(l/M). 



(100) 



Let us concentrate on the case I = q. In this case, e^(X) will contain terms of the form 
(ec(X))'"(et(X))''-™. For m < g, 



\nMX)r{e,{X)Y- 



< ^E[(e,(X))2']E[(e,(X))%-0] 

= (0(A;-'"'^/') X o(/t-(^-'")^/')) + e(A;) = o{k-'i^/^) + 0{e{k)). 



This therefore implies that, by (|96]), (l97|), ([gg]), ( |100| ) and ( |10l| ), 

E[e^(X)] = E[e^(X)] + 0(^-^-^/2) _^ g(^) 

= Oik-"^/^) + o(A;-^^/2) ^ o(l/M) + eik) 
= 0{k~'^^/^) + oil/M) + e{k). 



This finally implies that 

E[l|xes'}7(X)e«,(X 



E[lt(x)7(X)e^,(X)] + 0{m) (%(B) 
E[lt(x)7(X)e^(X)]+0(e(A:)) 



E[l|xeS'}7(X)e^(X)] + 0{e{k)) (by^) 
0{k-'i^'^) + o(l/M) + 0(e(A;)). 



This concludes the proof. 



(101) 



(102) 



(103) 



□ 



Before proving Lemma B.2 we seek to answer the following question: for which set of pair 
of points {X, y} are the fc-NN balls disjoint? 
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B.4.1 Intersecting and disjoint balls 



Define := {X,Y} G S' : ||X - F|| > R,{X) + R,{Y) where R,{X) and R,{Y) are the ball 
radii of the spherical regions Su{X) and S'„(F), such that f{z)dz = f{z)dz = 

(1 +Pk)kM- We will now show that for {X, F} G the fc-NN balls will be disjoint with 
exponentially high probability. Let d^^ and dY^ denote the /c-NN distances from X and Y 
and let T denote the event that the fc-NN balls intersect. For {X, Y} G 

Pr(T) = Pr(d^^ + df^ > ||X-F||) 

< Pr(d^) + > P,(X) + p,(y)). 

< Pr(dJ^ > P,(X)) + Pr{dP > R,{Y)) 
= Pr{P{X)>{pk + mk-l)/M)) 

+Pr(P(r)>(pfc + l)((A;-l)/M)) 
= 2Gik), 



where the last inequality follows from the concentration inequality (58). We conclude that 
for {X, Y} G the probability of intersection of /c-NN balls centered at X and Y decays 
exponentially in plk. Stated in a different way, we have shown that for a given pair of points 
{X,Y}, if the e balls around these points are disjoint, then the fc-NN balls will be disjoint 
with exponentially high probability. Let A^{X,Y) denote the event {X, F } G \1'^. From the 
definition of the region we have Pr({X, Y} G "^f) = 0{k/M). 

Let {X, y } G \l'e and let q,r be non-negative integers satisfying q + r > 1. The event that 
the /c-NN balls intersect is given by T := {df ^ + dj^ > ||X - The joint probability 

distribution of P(X) and P{Y) when the fc-NN balls do not intersect =: T*^ is given by 

(PxPy)'''^ (I-Px- PyY'^'^^ 



h'^ipx^Pv) = M\- 



(/t-l)!2 {M-2k)\ 



Define 



and note that 



<Px,Py) = p^^^^ ^^ Px Py (1 -Px -PyT \ 
I I l{px+PY<iAPx,PY)dpxdpY = 1. 

Jpx=0 JpY=0 

Figure [22] shows the distribution of the M samples when the A;-NN balls are disjoint. Now 
note that i{px,PY) corresponds to the density function ft^{j)x,PY) for the choices t = k, 
u = k and v = M — 2k + 1. Furthermore, for {X, F } G \l/e, the set Q := {px,Py} '■ 
Px,Py < (1 + Pk){k — 1)/M is a subset of the region 7 := {px,Py} '■ < Px,Py < 1; 
Px + Py < 1- Note that E[1q] = 1 — G{k). This implies that expectations over the region 
^ •= {Px,Py} '■ < Px,Py < 1; should be of the same order as the expectations over T with 
differences of order Q{k). In particular, for t,u < k, 

E[P-*(X)P-"(F)] = E[iTP-*(x)p-"(r)] + e{k). 



62 




Figure 22: Distribution of samples when fc-NN balls are disjoint. 



From the joint distribution representation, it follows that 

E[1tP-*(X)P-''(F)] _T{M -t)T{M -u) 
E[P-*(X)]E[P-«(F)] ~ T{M -t-u)T{M) 

Now observe that 

(A;M)*+"Cot;(p-*(X),p-"(r)) 



tu 
M 



0{l/M^). 



(104) 



M 



\t+ur 



E[p-*(x)p-"(r)] - E[p-*(x)]E[p-"(r)]] 



(A;m)^"E[P-*(X)]E[P-"(F)] 



E[P-*(X)P-"(F)] 
_E[P-*(X)]E[P-"(F)] 



tu 



l-^ + o(l/M^)-l 



(by (Q and (|T04|)) 



M 



+ 0{1/M'). 



(105) 



Then, the covariance between the powers of the error function E^, for qt,ru < k is given by 
Cov{Et{X), KiY)) = k'i,'^'''-^Cov{[p-\X) - E [P-\X)]Y , [p-^{Y) - E [p-^{Y)]Y) 

= |j |j C) (0 t^"^^"^' + o(l)]4r"'^Cot;(p-*'^(X), P-«^(F)) 



a=l 6=1 



a/ \6 



M 



1 



{g=l,r=l} 



—tu 
~M 



0{l/M^ 



(106) 



Proof, of Lemma B.2 Let Xi, .., Xm, X, Y denote M + 2 i.i.d realizations of the density /. 
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Then, identical to the derivation of (103) in the proof of Lemma B.l 



C^o4l|xes'}7i(X)e^(X),l|Yes'}72(Y)eUY) 
= Co4l|xes'}7i(X)e^„(X),l|Yes'}72(Y)e^(Y)] +0(e(fc)). 

Using the exact same arguments as in proof of Lemma A.l, it can be shown that the con- 
tribution of terms er(X),er(Y) to the R.H.S. of the above equation is o(l/M). Define 
tl(X, Y) := 7i(X)72(Y)Cot;|x,Y}[(ec(X) + e^iX))^, (e,(F) + et{Y)Y]. Thus, 

C7o4l|xeS'}7i(X)e^.(X),l{YeS'}72(Y)e^,(Y)] 
= E[l|x,YeS'}tl(X,Y)] + 0(e(fc)) 

= E[lA.c(x,Y)tl(X, Y)] + E[U^(x,Y)ti(X, Y)] + 0{e{k)) 

= i + n + o{t{k)). 



For {X, Y} e The covariance term Cov{x,Y}[(ec(X) +et(X))«, (eSY) + et{Y)Y] can be 
shown to be 0{k~^'^^^^^^'^) for q,r < k hj using Cauchy-Schwarz and (100), (101) as follows. 



\Cov[ie,iX) + etiX)y,ie,iY) + etiY)y]\ < VV[(e,(X) + e,(X))5]V[(e,(r) + et{Y)) 

< 



E[(e,(X) + et(X))2^]E[(e,(r) + et{Y)y-] 

(107) 



^O(A;-(29)'5/2)O(A;-(20<5/2) 



This implies that 

II = E[lA,(x,Y)tl(X, Y)] = E[lA,(x,Y)0(fc-('^+^)^/2 



O 



^((g+r)5/2-l)]Vf^ 

where the last but one step follows since the probability Pr({X, Y} G \E'^) = 0{k/M). 



For {X, Y} e Now note that (ec(X)+ei(X))5 will contain terms of the form (ec(X))™(et(X))''-™. 
For m<q, the term (ec(X))'"(et(X) )'?-'" will be a sum of terms of the form (A;/M)(™+")p-("+^)(X) 
for arbitrary f < g— mwithu— f >=2/d. By (105), the covariance term Cof [(ec(X))'"(ei(X))^~'", (ec(y))' 
will be therefore be 0{k^Jf' / M) if either m < q or n < r. 

On the other hand, if m = g and n = r, Cov[{e^{X))'i, {e^(Y)Y] = l{g=i^r=i}0{l/M) + 



0(1/M2) by noting that the error ec(X) = /(X)Ei(X) and subsequently invoking (106) 
Therefore 

/ = E[Uc(x,Y)tl(X,Y)] 

= e[1aj(x,y) (1{,=i,.=i}0(1/M) + 0{k]l//M) + 0{l/M^)) 
= l{,=i,,=i|0(l/M) + 0(4"/M) + 0(1/M2), 
where the last step follows from the fact that probability Pr({X, Y} G '^e) = 1 ~ 0{k/M) = 



0(1). 



□ 
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B.5 Specific cases 



We now focus on evaluating the specific cases 

E[l|xes'}7(X)e^(X) 
and 

Cot;[l|xeS'}7i(X)e,(X),l|YeS'}72(Y)efc(Y)], 

for k > 2. 



B.5.1 Evaluation of E[l|xes'}7(X)e^(X) 

P(X) has a beta distribution with parameters k, M — k + 1. Therefore for k > 2 
E[Ej(X)] = E [A;^(P-^(X) -E[P-'5(X)])' 
= A;^^E[P-'''(X)] - (E[P-^(X)])' 



k 



2/3 



fT{k-2f3)T{M + 1) /Tik- /3)T{M + 1) 

^ yr{k)r{M + 1-2(3) ~ Vr(fc)r(M + i-/3) 

0{l/k) 



where the last step follows by noting that for any a > 0, 

r{x) 



r(x + a) 



+ o(l/x)). 



From (103), 



E[l|xes'}7(X)e2(X)] =E[l|xes'}7(X)e2(X)] +0(e(fc)). 

Note that e^(X) = (ec(X)+ej(X)+er(X))^ is a sum of terms of the form (ec(X)) 
Also, 

neliX)] = f{X)E[kl{p-\X)-E[p-\X)]r] 
= f\X)klE[P-\X)] - {E[P-\X)]Y 

'r(fc-2)r(M + 1) /r(A;- i)r(M + 1)^ ^ 



(108) 



(109) 

e,(X))'(e,.(X))^ 



nx)ki^ 



1 fl 
k+'ik 



T{k)T{M + 1- 2) 



T{k)T{M) 



(110) 



Using (108), identical to the derivation of (100) and (101), it is clear that if / + m > 0, 
E[(ee(X))2-'-'^(et(X))'(e,(X))"^] = o{k-^) + o{l/M) + 0{e{k)). This implies that 



E[l|xes'}7(X)e2(X)] = E [l|xes'}7(X)e^(X)] + 0(e(fc)) 



nx) 



+ 



(111) 
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B.5.2 Evaluation of Cov [l{X6S'}7i(X)efe(X), l{Y6S'}72(Y)efc(Y)] 



We separately analyze disjoint balls and intersecting balls as follows: 



Cot;[l|xes'}7i(X)e,(X),l|YeS'}72(Y)efc(Y)] 
= E[[l|xes'}l{Y6S'}7i(X)72(Y)efc(X)e,(Y)]] 
= E[[l|xeS'}l{YeS'}7i(X)72(Y)eo(X)e„(Y)]] + 0{Q{k)) 

= E[[l|xes'}l{Yes'}7i(X)72(Y)(e,(X) + ei(X) + e,(X))(ee(Y) + et{Y) + e,(Y))]] + 0{Q{k)) 

= E[[l|xes'}l{YeS'}7i(X)72(Y)(e,(X) + ei(X))(e,(Y) + et{Y))]\ + 0{Q{k)) + o(l/M) 

= E[lA,c(x,Y)7i(X)72(Y)E{x,Y}[(ee(X) + et{X)){e,{Y) + ei(r))]] 

+E[lA.(x,Y)7i(X)72(Y)E|x,Y}[(ec(X) + e,(X))(e,(r) + e,{Y))]] 

+0{Q{k)) + o{l/M) 

= I + II + 0{Q{k)) + o{l/M). 



For {X,Y} e 

E[(e,(X))(e,(r))] = Covlie^iX)), (e,(r))] 



-fiX)f{Y) 
M 



0{l/M^) 



by noting that the error ec(X) = Ei(X)//(X) and subsequently invoking (106) in conjunc- 
tion with the condition k > 2. Similarly, using (g), Q and (|T06|), 



E[(e,(X))(e,(r))] = 0(4/Vm) + Oil/M') 

E[(e,(X))(ee(r))] = 0{k'J//M) + 0{1/M') 
E[(e,(X))(e,(F))] = 0(C/M) + 0{1/M') 

This implies that 

/ = E[lAc(x,Y)E{x,Y}[(ec(X) + et{X)){e,{Y) + et{Y))]] 



E 



LA^(X,Y) 



-/(X)/(r)(l/M) + Oik^'/M) + 0{1/M' 



= E[l|xeS'}l{YeS'}7i(X)72(Y)(/(X)/(Y))] i^-l/M + 0(A;1/Vm) + 0(1/M^) 
= -E[l|xes,|7i(X)/(X)]E[l|YeS'}72(Y)/(Y)]^ + 0(4f /M) + ©(l/M^). (112) 

where the last but one step follows from the fact that probability Pr({X, Y} G = 
l-0{k/M) =0(1). 
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For {X, Y} e First observe that by Cauchy Schwarz, and by ( p^OSj ) |E[Et(X)E„(X)] | < 
v/E[E2(X)]E[E2(X)] = 0(l/fc). This imphes that 



E[(e,(X) + et{X)){e,{Y) + e<(F))] = E[ee(X)ee(F)] + 0{k''J//k). 
In subsection |B. 7 , we will show Lemma B.5 , which states that 



E[lA.(x,Y)7i(X)72(Y)ee(X)ee(Y)] 
= E[l|xes'}7i(X)72(X)f(X)] (j^ 



M 



(113) 



This implies that 



// = E[U,(x,Y)E|x,Y}[(ec(X) + etiX)){e,{Y) + e,(r))]] 
= E[lA.(x,Y)E{x,Y}[ee(X)e,(F)] + 0(4f /A;)] 
= E[lA,(x,Y)7i(X)72(Y)ee(X)ee(Y)] + E 



0i4'/k) 



A,(X,Y 

E[l{xeS'}7i(X)72(X)//2(X)] ( ^ + 0{k'^'/M) + o 



(114) 



where the last step follows from recognizing that Pr({X, Y} G \E'^) = 0{k/M) and 0{k/M) x 
1/k = 0{l/M). This implies that 



Cot;[l|XGS'}7i(X)e,(X),l|Yes'}72(Y)e,(Y 
= / + // + 0(e(A;)) + o(l/M) 



Cot;[l|xes,}7i(X)//(X),l{Yes'}72(Y)//(Y)] ( -J + o(l/M) + 0(e(A;)). (115) 



B.6 Summary 



Noting that 6 > 2/3, the equations (83), (B.2), (111), (115) imply that for positive integers 
q,r < k, 

E[l|xeS'}7(X)eKX)] = l|,=2}E[l{xeS'}7(X)f (X)] + « (^) + 0(e(A;)), (116) 



C^ot^[l{xeS'}7i(X)e^(X),l|YeS'}72(Y)eUY 

= l|,,,=i|Cot;[l|X6S'}7i(X)/(X), l|Yes'}72(Y)/(Y)] (^^ + o(l/M) 
+ l{.+.>2} (o ( ^((^^,),Vi)m ) + ^(^M /M) + 0(1/M2)) + Oim)- (117) 
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B.7 Evaluation of E[e,(X)ec(y)] for {X,Y} G 



For {X, Y} G it will be shown that the cross-correlations E[ec(X)ec(F)] of the coverage 
density estimator and an oracle uniform kernel density estimator (defined below) are identical 
up to leading terms (without explicitly evaluating the cross-correlation between the coverage 
density estimates) and then derive the correlation of the oracle density estimator to obtain 
corresponding results for the coverage estimate. 

Oracle e ball density estimate In order to estimate cross moments for the /c-NN density 
estimator, the e ball density estimator is introduced. The e-ball density estimator is a kernel 
density estimator that uses a uniform kernel with bandwidth which depends on the unknown 
density /. Let the volume of the kernel be Ve{X) and the corresponding kernel region be 
Se{X) = {Y E S : Cd\\X — Y\\'^ < Ve{X)}. The volume is chosen such that the coverage 
Qe{X) = j.^^ f{z)dz is set to (1 + pk)k/M. Let \e{X) denote the number of points among 

{Xi, ..,Xm} falling in Se{X): le(X) = T,fi-^l^^^s^(x)- The e ball density estimator is defined 
as 

Also define the error e^{X) as e^{X) = ie{X) — E,[i^{X)]. It is then possible to prove the 
following lemma using results on the volumes of intersections of hyper spheres (refer Appendix 
A for details). 

Lemma B.3. Let 7i(X), 'j2{X) be arbitrary continuous functions. Let Xi,..,Xm,X,Y 
denote M + 2 i.i.d realizations of the density f . Then, 

E[lA.(x,Y)7i(X)e,(X)72(Y)e,(Y)] 

= E[l|xeS'}7i(X)72(X)/2(X)] (^1 + o (^^^ ) . 

Next, the cross-correlations of the coverage density estimator and the e ball density estimator 
are shown to be asymptotically equal. In particular. 

Lemma B.4. 

E[ee(X)ee(F)] = E[e,(X)e,(r)] + o{l/k). 

Proof. We begin by estabhshing the conditional density and expectation of fe(X) given fc(X). 
We drop the dependence on X and denote 1^ = Tjfi^^ls^Xii^Si(x)} ^ the /c-NN coverage by P and 
the e ball coverage Q,{X) by Q. Let q = Q/P and r = (Q - P)/(l - P). The following 
expressions for conditional densities and expectations are derived in |33j 

Pr{l, = Z|P;P>Q} 

^ r (V)q'(i-q)'''-' / = o,i,...,fc-i 

\ l = k,k + l,...,M 
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Pr{l, = /|P;P<Q} 



/ = 0, 1 k 



M-k\l-ki 



/M-k\ 
\ l-k ) 



1 - r 



,M-l 



'1 ^1 

k,k + l, 



1 

M 



which imphes 



/|P;P>g] = (A:-l)g/P 

1-g 



Z|P;P < Q] 



1 



{k-M) + M 



Using the above expressions for conditional expectations, the following marginal expectation 
are obtained. Denote the density of the coverage P by fk,M{p)- Also let P be the coverage 
corresponding to the k — 2 nearest neighbor in a total field of M — 3 points. Then 



E[e,(X)e,(X)] 



E[f,(X)fe(X)] - E[fe(X)]E[f,(X)] 

1-g 



E 



Pfl 



(A; - M) + M/P lp<Q 



PiX){k-l) 



E[{{k-1)Q/P') 1p>q] 



^ kM 
P{X) (M-l)(M-2) 

k {k-2){M-k) 
E[(l - gP)(A; - M) + MP(1 - P)] 



MQ. 



MQ 



+E[((A;-l)g(l-P) 
C X {I -11 + III). 



gp)(/,_M) + MP(l-P))(lp>Q)] 



It can be shown that Cx (/-//) = ^^(l-g) using the fact that P has a beta distribution. 
Note that from the definition of g = ((1 +Pk){k — 1)/M), from the concentration inequality 
we have that E[lp^g] = C(M). The remainder (C x ///) can be simplified and bounded using 
the Cauchy-Schwarz inequality and the concentration inequality to show C x III = o{l/M). 



Therefore, 



E[e,(X)e,(X)] 



k 



g) + e(M). 



PjX) f\X) ( 1 
k M \M 



(119) 



Now denote E(X) = (ec(X) - e,(X)). Note that E[E2(X)] = E[ee(X)2] - 2E[ec(X)e,(X)] + 
E[e,(X)2]. Since E[e,{X)^] = f\X)l + o{l/k) and S[e,(X)2] = /2(X)(1/A; + o(l/A;)) it 



follows from (119) that E[i?(X)] = o(l/k). This result means ec(X) and ee(X) are almost 
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perfectly correlated. Next express the covariance between the coverage density estimates in 
terms of the covariance between the e ball estimates as follows: 



E[e,(X)e,(F)] 

= E[(e,(X) + E(X))(e,(y) + E(y))] 
= E[e,(X)e,(F)] +E[e,(X)(E(r))] 
+E[e,(F)(E(X))]+E[(E(X))(E(r))] 
= I + II + III + IV. 



Using Cauchy-Schwarz, a bound on each of the terms //, /// and IV is obtained in terms of 
E[E(X)]: |//| < ^E[E(r)]E[e,2(X)], \III\ < ^E[E{X)]E[e,-^{Y)] and \IV\ < ^E[E{X)]E[E{Y)]. 
Note that the above application of Cauchy-Schwarz decouples the problem of joint expecta- 
tion of density estimates located at two different points Xand F to a problem of estimating 
the error E between two different density estimates at the same point (s). Therefore all the 
three terms //, III and IV are o{l/k). This concludes the proof of Lemma B.4 □ 



For Lemma B.4 to be useful, E[e£(X)ee(y)] must be orders of magnitude larger than the error 
0(1 /A;), which is indeed the case for {X,Y} e vp/ since E[e^(X)e,(F)] = 0{l/k) (Lemma A. 2, 
Appendix .1) for such X and Y. This lemma can be used along with previously established 



results on co- variance of e-ball density estimates (Lemma B.3) to obtain the following result: 



Lemma B.5. Let 71 (X), 72(X) be arbitrary continuous functions. Let Xi, .., X^/, X, Y 
denote M + 2 i.i.d realizations of the density f . Then, 

E:[lA.(x,Y)7i(X)72(Y)ee(X)ee(Y)] 



Proof. 



E[l|xes'}7i(X)72(X)f (X)] ( ^ + o 



lE[lA.(x,Y)7i(X)72(Y)Ex,Y[ec(X)ee(F)]] 

= E[lA.{x,Y)7i(X)72(Y)e,(X)e,(Y)] + o(l/A;) 



E[l|xeS'}7i(X)72(X)f (X)] (j^ 



+ 



1 

M 



In the second to last step, o(l/M) is obtained for the second term by recognizing that 
Pr({X, Y} G = 0{k/M) and 0{k/M) x o{l/k) = o(l/M). □ 



C Boundary correction for density estimates 

In the previous section, moment results were established for the standard /c-NN density 
estimate ffc(X) for points X in any deterministic set S' with respect to the samples Xm = 
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{Xat+i, .., Xat+a/} satisfying the condition Pr(X ^ §') = o(l) and S' C S/, where X is an 
reahzation from density /. In this section, these moment results are extended to boundary 
corrected fc-NN density estimate ik{X) for all X G S as follows. 

Specify the set S' to be S' = S/ as defined in Exclusively using the set Xn = {Xi, .., Xat}, 
a set of interior points Jj^ C Xj^ are determined such that C S' with high probability 
1 — 0(A^C(A;)). Define the set of boundary points "Bj^ = Xy( — Jj^. For points X G Jn, the 
boundary corrected /c-NN density estimate iki^) is defined to be the standard /c-NN estimate 
ffc(X), and we invoke the moment properties of the standard /c-NN density estimate ffc(X) 
derived in the previous section. For points X G Sj^, the density estimate ffc(X) is defined 
as ik{yn) for points F„ G Jn, and we invoke the moment properties of the standard /c-NN 
density estimate ffc(X) derived in the previous section. 



C.l Bias in the /c-NN density estimator near boundary 

If a probability density function has bounded support, the /c-NN balls centered at points 
close to the boundary are often truncated at the boundary. Let 



ak{X) 



be the fraction of the volume of the fc-NN ball inside the boundary of the support. Also define 
Vfc,Af(-^) to be the fc-NN ball volume in a sample of size M. For interior points X G S', 
ak{X) = 1, while for boundary points X G S — S', aA:(X) is closer to when the points are 
closer to the boundary. For boundary points we then have 

E[f,(X)] - /(X) = (1 - «,(X))/(X) + 0(1). (120) 

Therefore the bias is much higher at the boundary of the support (0(1)) as compared to its 



interior {0{{k/ MY^'^)) (81). Furthermore, the bias at the support boundary does not decay 
to as k/M -> 0. 

In the next section, we detect interior points J>f which lie in S' with high probability 0{NQ{k)). 
The results on bias, variance and cross-moments derived in the previous Appendix for points 
X G S' therefore carry over to the points J^- A density estimate at points 'Bji is then proposed 
that will reduce the bias of density estimates close to the boundary. 



C.2 Boundary point detection 



Define Vk,M{X) 



Let p(fc, M) be any positive function satisfying p(/c, M) 



Mak{X)f(X)- 

Q{{k / + / Z"^) . From the concentration inequality (58) and Taylor series expansion 
of the coverage function (70), for small values of k/M, we have 

'V,,m(X) 



I- Pr 



Vk,M{X) 



1 



<pik,M) =0{e{k)). 
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To determine Jj^ and Sj^, we first construct a i^'-NN graph on the samples X^v where K = 
\_k X {N/M)\. For any X G X>f, from the concentration inequality (58) 



1 - Pr 



Vk,7v(X) 



Vk,n{X) 



<p{K,N))=0{e{K)) = 0{m), 



'i2r 



where Q{K) = 0{G{k)) because by (^.0), K = 6{k). This implies that, with high probabihty, 
the radius of the i^-NN ball at X concentrates around {VK,NiX) / CdY^'^- By this concentration 



inequality (121 ), this choice of K guarantees that the size of the fc-NN ball in the partitioned 



sample is the same as the the size of the i^'-NN ball in the pooled sample with high probability 
1 — Q{k). By the union bound and (121), the probability that 



K,N 



{X) 



- 1 



<p{K,N) 



is satisfied by every Xj G Xn is lower bounded by 1 — 0{NQ{k)). 

Using the i^-NN graph, for each sample X G Xj^, we compute the number of points in 
X>f that have X as a /-th nearest neighbor (/-NN), / = {1,. . . Denote this count as 

count(lC). Let Y be the Z-nearest neighbor of X, Z = {1, . . . , K}. Then Y can be represented 
asY = X + Rk{X)u where u is an arbitrary vector with < 1. 

For X to be one of the /T-NN of Y it is necessary that Rk(Y) > ||y — X|| or equivalently, 
Rk(Y)/ Rk{X) > \\u\\. Using the concentration inequality (121) for Rk{X) and Rk{Y), a 
sufficient condition for this is 



aK{X)f{X) 
aK{Y)f{Y) 



2p{K,N))> 



\u\ 



(122) 



Because / is differentiable and has a finite support, / is Lipschitz continuous. Denote the 
Lipschitz constant by L. Then, we have \f(Y) - f{X)\ < l.{K / CdNeaf'^ . Define q{K,N) = 



{h/eo){K/cdNeoy/'^ + 2V6/k^/\ Then (122) is satisfied if 



(l-g(ir,iV))> 



\u\ 



For points X G S', axiX) = 1 with probability 1 — G{k). This implies that X will be one 
of the i\:-NN of y if < 1 - q{K,N). This implies that, with probability 1 - 0{NQ{k)), 
countiX) > K{l-q{K, N)) whenever X G S'. On the other hand, for X G S - S', axiX) < 1 
with probability 1 — G{k). It is also clear that for small values of K/N, axiX) < axiY) for 
at least K/2 /-NN Y of X. This then implies that count{X.) < K{l-q{K, N)) for X G S - S' 
with probability 1 — 0{NC{k)). We therefore can apply the threshold K{1 — q{K,N)) to 
detect interior points J>f = Xj^ fl S' and boundary points S^r = Xj^ — Jj^ = Xj^ fl (S — S') 
with high probability 1 — 0{NC{k)). Algorithm 1, shown below, codifies this into a precise 
procedure. 
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Algorithm 1 Detect boundary points 

1. Construct i^-NN tree on X>f 

2. Compute count(X.) for eacli X G Xj^ 

3. Detect boundary points Sj^: 
for each X G X>f do 

if count{X.) < (1 - N))K then 

Sjv ^ X 
else 

Jtv ^X 
end if 
end for 



C.3 Boundary corrected density estimator 

Here the boundary corrected fc-NN density estimator is defined and its asymptotic rates are 
computed. The proposed density estimator corrects the fc-NN ball volumes for points that 
are close to the boundary. To estimate the density at a boundary point X G "Bj^, we find 
a point Y G Jjvf that is close to X. Because of the proximity of X and Y, /(X) ^ /(Y). 
We can then estimate the density at Y instead and use this as an estimate of /(Y). This 
informal argument is made more precise in what follows. 

Consider the corrected density estimator f^ defined in (jsj). This estimator has bias of order 
0((fc/M)i/^), which can be shown as follows. Let X denote Xj for some fixed i G {1, ..,iV}. 
Also, let X_i = argmin^gs' d{x,X.). 



Given X^r, if X G Jn, then by (81) 



E[f,(X)] = E[f,(X)] = fix) + 0((fc/M)2/^) + 0(e(A;)). 

Next consider the alternative case X G B^y. Let X„ G Jn be the closest interior point to 
X. Define h = X — X„. h can be rewritten a.s h = hi + h2, where hi = X — X_i and 
h2 = X__i - X„. Since X e "Bn implies that X G S - S' with probability 1 - 0{Ne{k)), 
consequently \\hi\\ = ||X -X_i|| = 0((A;/M)i/°') with probability 1 - 0{NG{k)). 

Again with probability 1 — 0{NG{k)), X„ G S'^ Let Qn = Uyes/argmin^g3j^(i(x, F). By 



construction of Cat, X„ G Cjv. Consequently, by ( |l2l| ), | |/i2| | = | |X_i -X„| | = 0{{1/Ny/'^) 
o{{k/Mf'^). 

Because \\hi\\ = ||X - X_i|| = 0{{k/Mf/'^) and ||/i2|| = ||X_i - X„|| = o{{k/Mf/'^) with 
probability l-0(Xe(A;)), consequently with probability l-0(Xe(A;)), \\h\\ = O^^k/Mf/'^). 
Now, 

/(X) = /(X„) + 0(11/^11). 



If Xn is located in the interior S', by (81) 



E[f,(X„)] = /(X„) + 0((A;/M)2/'^) + 0(e(A;)), (123) 
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and therefore 



E[f,(X)] = E[A(X„)]+0(iVe(fc)) 

= + 0{{k/Mf''') + 0{NQ{k)) 

= f{X) + Oi\\h\\) + 0{{k/Mf''') + 0{Ne{k)) 

= f{X) + 0{{k/My/'^) + 0{Ne{k)), (124) 

where the 0{NQ{k)) accounts for error in the case of the event that X„(j) ^ S'. This imphes 
that the corrected density estimate has lower bias as compared to the standard /c-NN density 
estimate (compare to (81 ) and (120)). In particular, boundary compensation has reduced the 
bias of the estimator at points near the boundary from 0(1) to 0{{k/My/^) + 0{Ne{k)). 



C.4 Properties of boundary corrected density estimator 



By section C.2[ Jj^ E §' with probabihty 1 — NQ(k). The results on bias, variance and cross- 
moments of the standard A;-NN density estimator derived in the previous Appendix for 
points X G S' therefore carry over to the corrected density estimator for points Jj^ with 
error of order 0{NG{k)). 

In the definition of the corrected estimator in (js), ffc(X„(j)) is the standard fc-NN density 
estimates and X„(j) G S' . It therefore follows that the variance and other central and cross 
moments of the corrected density estimator will continue to decay at the same rate as the 



standard /c-NN density estimator in the interior, as given by (116) and (117). 



Given these identical rates and that the probability of a point being in the boundary region 
S — S' is 0{{k/MY^^) = o(l), the contribution of the boundary region to the overall variance 
and other cross moments of the boundary corrected density estimator are asymptotically 
negligible compared to the contribution from the interior. As a result we can now generalize 
the results from Appendix A on the central moments and cross moments to include the 
boundary regions as follows. Denote ffc(X) — Ex[ffc(X) | X] by e(X). 

C.4.1 Central and cross moments 



For positive integers q,r < k 



E[7(X)e''(X)] = l|,=2}E[7(X)f (X)] {l)+o(l] +0(Xe(fc)), 



(125) 



Cot;[7i(X)e^(X),72(Y)e^(Y)] 



= l|,,,=i}Cot;[l|xeS'}7i(X)/(X), l|YeS'}72(Y)/(Y)] + o(l/M) 

+ l{,+.>2} (o + 0{ej,'/M) + 0(1/M^)) + 0{Ne{k))- 

Next, we derive the following result on the bias of boundary corrected estimators. 



(126) 
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C.4.2 Bias 



For k>2, 

E[7(E[ffe(X) I X]) - 7(/(X)))] = E [e f(7(f.(X)) - 7(/(X))) | X;v' 



E 



E 



l|xea.|(7(E[f,(X)])-7(/(X)))|X;v 



+ E 



E 



/ + //. 



l|xei3.|(7(E[f,(X)])-7(/(X))) 

(127) 



From (|8l|), and Pr(X G S^) = 0{{k/Mfl'^), we have 

/ = E [7'(/(X))MX)] j +o[-^) + 0{Ne{k)). 



(128) 



Next, we will now derive //. 



II 



E 



E 
E 



l|xe2,l(7(E[f,(X)])-7(/(X))) |X;v 



E 



Uxe^^MfiXn)) - 7(/ W)) + O i-j I 



+ 0(iVe(A;))(129) 



where the last step follows by (123). Let us concentrate on the inner expectation now. By 
section C.2, we know that with probability 1 — 0{NQ{k)), if X G 23 ^r, then X G S — S' and if 
X„ G Jtv, thenX„, G S'. Furthermore, ||X-X_i|| = 0(A;/M)i/'^ and | |X_i-X„| | = o{k/Mf/'^ 
with probability 1 — 0{NQ{k)). This implies that 



E 



l{xei3,}(7(/(^n))-7(/W)) + 



M ) 



2/d 



X 



= E [l|xes-S'}(7(/(^-i)) - 7(/(^))) I X^] + 
Since Pr(X G S - S') = 0{{k/Mf/'^), this in turn implies that 

"l|xea^,}(7(E[f,(X)])-7(/(X))) |X;v 



N 



M 



i/d 



+ 0{NQ{k)). 



II 



E 



E 



E[l|xes-S'}(7(/(X-i)) - 7(/(X)))] + o 



k_^ 

M ) 



2/d 



0{NQ{k)). (130) 



We therefore finally get, 

E[7(E[ffc(X) I X]) - 7(/(X)))] = / + // 



E[7'(/(X))MX)] 



2/d 



E[l{xes-s'}(7(/(X-i)) - 7(/(X)))] + o[-\ + 0(X«) 



k \ 



2/d 



Note that ||X- X_i|| = 0{{k/Mf/'^) with probability 1 -0(Xe(A;)). This therefore implies 
that 

C3 = E[l|xes-s'}(7(/(X-i))-7(/(X)))] = 0((fc/M)i/^) xO((fc/M) V^)+0(Xe(A;)) = 0{{k/ Mf/^)+0{NQ{ 
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C.4.3 Optimality of boundary correction 



Comparing (131), (125) and (126) witli (81), (116) and (117) respectively, oracle rates of con 



vergence of bias, and central and cross moments for the boundary corrected density estimate 
are attained. The oracle rates are defined as the rates of MSE convergence attainable by the 
oracle density estimate that knows the boundary of § 



k-1 



where Vfc o(X) is the volume of the region Sfc(X) fl S. It follows that the boundary com- 
pensated BPI estimator is adaptive in the sense that it's asymptotic MSE rate of convergence 
is identical to that of a fc-NN plug-in estimator that knows the true boundary. Equivalent 
corrections exist for the uniform kernel density estimator and will be left to the reader. 



D Proof of theorems on bias and variance 

Lemma D.l. Assume that U{x,y) is any arbitrary functional which satisfies 

(z) sup \U{x,y)\ = Go < oo, 

(a) sup \U{x,y)\C{k) = Gi < oo, 

a:G((?i,g„) 

{iii)K[ sup |f/(a;/p, ?/)|] = G2 < oo. 

Let Z denote Xj for some fixed i G {1, .., A^}. Let (z be any random variable which almost 
surely lies in the range (/(Z), ffc(Z)). Then, 

E[|f/(Cz,Z)|]<oo. 

Proof. We will show that the conditional expectation E[|f/(^^, Z)| | X^] < 00. Because 
< Co < /(X) < Coo < c>o by (/l.l), it immediately follows that 

E[|f/(Cz,Z)|] =E[E[|[/(Cz,^)| I X^]] < 00. 
For fixed Xn, ZeJnotZe'Bn- These two cases are handled seperately. 



Case 1: Z E Jn In this case, ik{Z) = ik{Z). By (76) and (^.1), we know that if \]{Z) 



holds, Pi/PiZ) < ffe(Z) < PufPiZ). On the other hand, if ^%Z) holds, by (78) and {A.l 
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qi < ik{Z) < qu- This therefore imphes that if \\{Z) holds, min{eo,p«/P(2')} < C,z < 
m.ax{eoQ,Pu/'P{Z)} and if t]''(Z) holds, min{eo,gi} < Cz < max{eoo, Then, 

E[|[/(Cz,^)| I Xjv] = E[l^(z)|?7(Cz,^)| I X;v]+E[%^)|t/(Cz,^)| I X^] 

< G^ + n\z) sup |f/(x/P(Z),Z)|]+max{Go,Gi/e(A;)}(l-Pr(^(Z))) 

'j:e{pi,Vu) 

< Go + E[ sup |f/(x/P(Z),Z)|] +max{Go,Gi/e(A;)}(l-Pr(t](Z))) 
= Go + G'2 + max{Gi/e(M),Go}e(A;) 

= G'o + G'2 + max{G'i,G'oe(A;)} < oo (132) 
where the final step follows from the fact that Q{k) = o(l). 



Case 2: Z G Sjv If 2' G !Bjv, let be the nearest neighbor of Z in the set J^. Then, 

h{Z) = h{Yn) (133) 

This implies that we can now condition on the event \]{Yn), and follow the exact procedure 
as in case 1 to obtain 

E[\UiCz,Z)\\Xr,] = E[V„)|f/(Cz,^)||X^]+E[%y„)|f/(l/a,^)||X^] 

< Go + G'2 + max{G'i,G'oe(A;)} < oo (134) 

where the final step follows from the fact that Q{k) = o(l). This concludes the proof. 

□ 

Proof of Theorem 13.11 

Proof. Using the continuity of g"'{x,y), construct the following third order Taylor series of 
(7(fjfc(Z),Z) around the conditional expected value E2[ffc(Z)] = E[fjt(Z) | Z]. 

g{U^), Z) = (7(Ez[ffc(Z)], Z) + (?'(Ez[ffc(Z)], Z)e(Z) 

+ ^/(Ez[f.(Z)], Z)e2(Z) + ^g^'\Cz, Z)e%Z), 

where Cz G (Ez[ffc(Z)], ffc(Z)) is defined by the mean value theorem. This gives 
E[((7(ffc(Z),Z)-(7(Ez[f..(Z)],Z))] 



E 



V'(Ez[ffe(Z)],Z)e2(Z) 



+ E 



6 



Let A(Z) = ^g^^\Cz, Z). Direct application of Lemma D.l in conjunction with assumptions 
(^.5) , {A.6) implies that E[A^(Z)] = 0(1). By Cauchy-Schwarz and assumption (yi.4) 



applied to (|125|) for the choice g = 6, 
'1 



E 



6 



A(Z)e^(Z) 



< 



E 



36 



A2(Z) 



E [e6(Z)] = 



o{Ne{k)). 
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By observing that the density estimates {ffc(Xj)},'i 
have 

E[G^(f,)] - G{f) = E[(7(f,(Z), Z) - g{f{Z), Z)] 

'1 



are identical, we therefore 



E[(7(Ez[ffc(Z)],Z)-(7(/(Z),Z)]+E 



/(Ez[ffc(Z)],Z)e^(Z) 



oii/k) + oiNe{k)). 



By (131) and (125) for the choice g = 2, in conjunction with assumption (yi.4),this imphes 
that 



2/d 



E[G^(ffc)] - = E[g'{f{Z),Z)h{Z)]i^-j + E[l|zes-M W(Z-i), Z-i) - ^^(/(Z), Z 



+E[f (Z)/(Ez[f.(Z)], Z)/2] I - ) + 0{Nm) + o{r.+ 



k \ 



2/d> 



M 



( k \ ' 

E[^7'(/(Z), Z)h{Z)\ (^-J + E[l|zes-s,}(^7(/(Z-i), Z_i) - g{j{Z), Z 

+E[f (Z)/(/(Z), Z)/2] (0 + 0(iVe(fc)) + « (^^ + (^) 
kV'" 



M 



where the last but one step follows because, by (81) and (124), we know Ez[fA,.(Z)] = /(Z) + 
0(1). This in turn implies E[/2(Z)(7"(Ez[ffc(Z)], Z)/2] = E[/2(Y)^7"(/(Y), Y)/2]. Finally, by 
assumption (yi.5) and (yi.2), the leading constants c\ and C2 are bounded. We have also 
shown in equation (130) that C3 = Oi^ikjM)'^!'^'). This concludes the proof. 



□ 



Proof of Theorem 15.11 



Proof. Let X denote Xj for some fixed i G {1,..,A^}. Also, let X_i = argmin3,.gSj (^(3;, X). 
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Using (82), we can derive the following in an identical manner to (131): 

B(G^,Bc(ffe)) = E[GN,Bc(ik)]- J 9{f{x),x)f{x)dx 

= (E[g{h{Z),Z)]-g2{k,M))/g,ik,M)- [ g{f{x),x)f{x)dx 



E[E[igihiZ),X.) -g2{k,M))/g,{k,M) \ X^]] - J g{f{x),x)f{x)dx 

E[E[{g{U^),X) - g2{k,M))/g,{k,M) \ X^],X e J,v] 
+E[E[{g{W^),X.)-g2ik,M))/g,{k,M) \ X^],X G S^] 

- / 9{f{x),x)f{x)dx 



9i{k,M) 

+o{{k/Mf''') + 0{NQ{k))] - j gif{x),x)f{x)dx 

Because we assume the logarithmic growth condition k = 0((log(M))^/*^^~'^)), it follows that 
0{Ne{k)) = 0{N/M^) = o(l/T). Also, by 1^, gi{k,M) = 1 + o(l). This implies that 



B(G^,Bc(f.)) = +C3 + on-j j. (135) 

□ 

Proof of Theorem 13.21 and Theorem 15.21 

Proof. By the continuity of g^'^^ ix,y), we can construct the following Taylor series of 5^ (f^ (Z) , Z) 
around the conditional expected value E^[ffc(Z)]. 

9(ft(Z),Z) = 9(Ezft(Z)],Z) + 9'(Ez|fi(Z)|,Z)e(Z) 

where e (^(Ez[ffc(Z)], ^(ffc(Z))). Denote (c/^(^z, Z))/A! by ^(Z). Further define the 
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operator M(Z) = Z - E[Z] and 



= M((7(ExJffc(X,)],X0), 
q, = M(^7'(ExJffc(X,)],Xi)e(Xi)), 

. ^^^.^-)(ExJf.(x.)],xo^.^^^^ 

Vi=2 

= M(*(Xi)e"(Xi)) 

The variance of the estimator G7v(ffc) is given by 

V[G^(f,)]=E[(G(/)-E[G(/)])^] 
^;E[{pi + qi + ri + SiY] 



+ 



N 
N -1 



N 



+ gi + ri + Si){p2 + q2 + r2 + S2)]. 



Because Xi, X2 are independent, we have E[(pi)(p2 + q2 + r2 + S2)] = 0. Furthermore, 
E[(pi + gi+ri + si)'] = E[pi2] + o(l) = V[^(Ez[f(Z)],Z)] + o(l). 



From assumption (^.4) apphed to ( 125[ ) and ( |126 ), in conjunction with assumption (^.3), it 
follows that 

• E[pi2] =v[(7(Ez[fA.(Z)],Z)] 

. E[q,q2] = V[^?'(Ez[f,(Z)], Z)/(Z)] (^) + o {jj) + 0{Ne{k)) 



1 



)+o( ^i^Ii^ ) + o(ive(fc)) = o (^) + omik)) 



• nrir2] = EaEaO( .»n..)Wi)M ) + O (^^(^1^) + 0{Nm) 

o{Ne{k)) 

Since gi and S2 are mean random variables 

E[q,S2] = E [gi^(X2)(f(X2) - ExJffc(X2)])" 

= E L^(X2)(f(X2) - ExJffc(X2)])" 



(i^) + 



< ^E[M/2(X2)]E [gf(f(X2) -Ex,[f.(X2)])2A 
= VEMz)l(o(^)+0(iVe(A:)) 
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Direct application of Lemma D.l|in conjunction with assumptions (^.5), (^.6) implies that 



E [\I/^(Z)] = 0(1). Note that from assumption (yi.3), o (p^) = o(l/M) . In a similar manner, 
it can be shown that E[riS2] = o {^) +0{Ne{k)) and E[siS2] = o {j^) + 0{Ne{k)). Finally, 
by ^ and (|l24|, we know Ez[ffc(Z)] = E[ffc(Z)] = /(Z) + o(l). This implies that 



V[G;v(ffc)] 



V[(7(Ez[ffe(Z)],Z)] 



-E[qiq2]+0{Ne{k)) + o 



1 



M N 



N 



V[(7'(Ez[ffc(Z)],Z)/(Z)] 



1 

M 



oiNe{k)) + o 



A 



Vb(/(Z), Z)] (j^ + Y[g'U\Z), Z)/(Z)] (j^ + OiNm) + « + ^ 



C4 



AT 



where the last but one step follows because, by (81) and (124), we know Ez[ffc(Z)] = /(Z) + 
o(l). This in turn implies V[^7(Ez[ffc(Z)], Z)] = V[(7(/(Z), Z)] and Y[g'(EzMZ)],Z)f{Z)] = 
Y[g'{f{Z), Z)/(Z)]. Finally, by assumptions (71.5) and (71.2), the leading constants C4 and C5 
are bounded. This concludes the proof of Theorem |3.2[ 

Under the logarithmic growth conditio n k = 0((log(M))^/(-'^~'^)), g2{k, M) = o(l) and gi{k, M) 
1 + 0(1) by assumption ([s]). Theorem 5.2 follows by observing that GN,Bc(!^k) = (GAr(ffc) — 
g,ik,M))/g2ik,M) ' □ 



Bias of Baryshnikov's estimator: Proof of equation (|5 

Proof. We will first prove that 

B(G^,(ffc)) = e((fc/M)i/'^ + l/fc), 



(136) 



Because the standard /c-NN density estimate iksO^i) is identical to the partitioned /c-NN 
density estimate ffc(Xj) defined on the partition {Xj} and {Xi, .., X^} — {Xj}, it follows that 



KG^(ffcs)) = e((A;/T)^/'^ + l/A;). 



(137) 



From the definition of set §' in section B.2.1, we can choose the set S', such that Pr{Z ^ 

s') = oiik/My/^). 

E[G^{h)] - G{f) = E[g{h{Z), Z) - g{f{Z), Z)] 

= E[l|Z6S'}^7(ffc(Z), Z) - g{f{Z), Z)] + E[l|zes-s'}^7(f;t(Z), Z) - g{f{Z), Z)] 

= 1 + 11 (138) 



Using the exact same method as in the Proof of Theorem 3.1, using (81) and (116), and the 
fact that Pr(Z ^ §') = 0{{k/My/^) = o(l), we have 



k \ 



2/d 



I = E[g'{f{Z), Z)h{Z)] ( -\ + E[/^(Z)/(/(Z), Z)/2] ^ j + 0(e(fc)) + o ^ ^ + 



1 



M 



2/d 



Because we assume that g satisfies assumption (^.6), from tlie proof of Lemma D.l, for 
Z G S - S', we have E[^(ffe(Z), Z) - ^(/(Z), Z)] = 0(1). This imphes that, 



11 = E[l|zes-s'}^7(ffe(Z),Z)-(7(/(Z),Z)] 

= E [E[g{h{Z), Z) - g{f{Z), Z)] \ l{zes-S'}] x Pr(Z ^ S') 
= 0(1) X 0{{k/Mf/'^) = 0{{k/My/'^). 

This concludes the proof. 



(139) 



□ 



E Asymptotic normality 

Define the random variables {YM,i', i = 1, ■ ■ . , N} for any fixed M 

_ g{UX,),Xi) -E[g{h{X,),X,)] 

1 M,i — 



N 



/V[^/(ffc(X,),X,)] 

and define the sum Sn^m 

* i=l 

where the indices and M explicitly stress the dependence of the sum S n,ai on the number 
of random variables + M. Observe that the random variables {Ym,^; i = 1, . . . , A^} belong 
to an mean, unit variance, interchangeable process [5] for all values of M. To establish 
the CLT for Sn,m, we will exploit the fact the random variables {YM,i',i = 1, . . . , A^} are 
interchangeable by appealing to DeFinetti's theorem, which we describe below. 



E.l De Finetti's Theorem 

Let 5" be the class of one dimensional distribution functions and for each pair of real numbers 
X and y define 3^{x, y) = {F E 3^\F{x) < y}. Let S be the Borel field of subsets of 3^ generated 
by the class of sets 3^{x,y). Then De Finetti's theorem asserts that for any interchangeable 
process {Zj} there exists a probability measure n defined on S such that 

Pr{B} = J Prp{B}dfi{F), (140) 

for any Borel measurable set defined on the sample space of the sequence {Z^}. Here Pr{B} 
is the probability of the event B and PrpiB} is the probability of the event B under the 
assumption that component random variables Xj of the interchangeable process are inde- 
pendent and identically distributed with distribution F. 
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E.2 Necessary and Sufficient conditions for CLT 



For each F e 3^ define m{F) and ct^(F) as m{F) = J^^xdF{x), cr{F) = j^^x^dF{x) — 1 
and for all real numbers m and non-negative real numbers let 5'm,o-2 be the set of F G 3" 
for which m{F) = m and (j'^{F) = a"^ . 

Let {Zj; i = 1, 2, . . .} be an interchangeable stochastic process with mean and variance 1. 
Blum etal |5] showed that the random variable Sat = Yl!i=i converges in distribution 
to A^(0, 1) if and only if /i(5'o,o) = 1- Furthermore, they show that the condition /i(?'o,o) = 1 
is equivalent to the condition that Cof(Zi,Z2) = and Cof (Z^,Z|) = 0. We will extend 
Blum etaVs results to interchangeable processes where Cow(Zi, Z2) = o(l) and Cow(Z^, Zg) = 

0(1). 

In particular, we will show that Cov{Y m,i,Y m,2) and Cof (Y|^^, Y^jg) are 0(1/M). Sub- 
sequently we will show that the random variable Sn,m = Yl!i=i^M,i converges in distri- 
bution to A^(0, 1) and conclude that Theorem 3.3 holds. 



E.3 CLT for Asymptotically Uncorrelated processes 



Let X be a random variable with density /. In the proof of Theorem |3.2[ we showed that 

Cot;((?(ffc(X,),X,),(?(ffc(X,),X,)) 



M,j) 



V[(7(ffc(X,),X,)]V[(7(ffc(X,),X,)] 
Cov{pi + qi + ri + Si, Pj + qj + rj + Sj] 



'V[^?(ffc(X,),X,)]V[(7(ffc(X,),X,)] 
Cov{pi + qi + ri + Si, Pj + qj + Vj + Sj) 



'V[(7(ffc(X,),X,)]V[(7(ffc(X,),X,)] 
V(^7'(/(X),X)/(X)) / 1 \ /I 



V[(7(/(X,),X,)] \MJ \M 
V(^'(/(X),X)/(X)) / 1 \ fl 



V[(7(/(X,),X,)] \MJ \M 



+ o{Ne{k)) 

(141) 



where the last but one step follows by observing that NC{k)/M — > under the logarithmic 
growth condition k = 0((log(M))^/*^-'^~^)). Define the function d{x,y) = q( x, y) (qix, y) — c), 
where the constant c = E[(7(fyfc(X), X)]. Then, similar to the derivation of (141), we have, 

r 1^2 v2 ^ - C'ot;(d(f,(X,),X,),rf(f,(X,),X,)) 

OOf 1^ 1 Af,j5 1 M,j) — 



V[rf(ffc(X,),X,)]V[rf(ffc(X,),X,)] 
V(rf'(/(X),X)/(X)) / 1 \ fl 



V[rf(/(X,),X,)] \M) \M 



(142) 
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Proof of Theorem 13.31 and Theorem 15.31 



Proof. Let 5^(M) and 5o-(M) be a strictly positive functions parameterized by M such that 
5^{M) = o(l); = o(l), 5.(M) = o(l); = o(l). Denote the set of F G 3^ with 

S'mAM ■■= {^'(^) > ^-M/ := {^'(^) > <^a(M)}; J;;^,,,, := {m\F) e (0,5^(M))} 

and 3^1 SM '■— W^i^) ^ (0,5o-(M))}. Denote the measures of these sets by fim,5,M, fJ'a,s,M, 



/^m s M f^ts M respectively. We have from ( 140 ) that 
/ m\F)dfi{F) = Cov{YM,i,YM,j) 

fa'{F)dfiiF) = [[EF[Z'-l]]'dfiiF)=CoviYl^„Yl^^). (143) 
Applying the Chebyshev inequality, we get 

S^{M)Hm,5,M < Cov(YM,i,YM,j), 

5AM)fi^,s,M <Cov(Yl^,,Yl,^^). 

Because the covariances decay at 0(1/M), /im.,<5,M and fia,s,M — )• as M — )■ oo. From the 
definition of 3^^,s,m •^a,s,My ^^^^ have that fim,s,M iAt,5,m — > as M — oo. We 
also have 

and therefore 

lim /i(:ro,o) = 1. (144) 



We will now show that GAr(ffc) = (Gjv(ffe) — E[Gjv(ffc)])/(\/ ^[GAf(ffc)]) converges weakly to 
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N(0, 1). Denote ^(ffc(Xi),Xi) by gi. Observe that 

lim Pr{Gjv(ffc) < «} = Hm / PrpiGNih) < a}dfi{F) 

= lim f Prp{GNih) < a}dfx{F) + lim f li^^^.j^ ^,}Pr^{Giv(ffc) < a}dfx{F) 
= Hm ^ PrpiG^^h) < a}dfi{F) + ^ Ji m (l{^65-5o,o}^^F{G;v(ffc) < «}) 



lim / PrpiG^ih) < a}dfi{F) 



lim / P,y 1 V «(fUX.).X.)-Eb(f.(X.).X.)) ^ I 



Ihn / Pr, I'J: «(f.(X.).X.)-Eb(f,(X.).X.)] ^ 1 . 



1 

M ^ 



lim y Pr^ <j - > ' I ^(f.(XO,X.)-E[^(f,(X.),X.)] ^ ^ ^ 



N 



'Y[g,]/N+{{N-l)/N)^Y[gi]Y[gj]Cov[YM,,YMj 
lim / Pr J 1 E ( " ^[^(f^(X.),X.)]\ ^ ^ I 



lim / Pr 



{a)dfi{F) = (j){a), 

where (i>(.) is the distribution function of a Gaussian random variable with mean and vari- 



ance 1. Step (D.6) follows from the Dominated Convergence theorem. By (144), limA-s>o l{FeJ-Jo o} 
almost surely. This gives Step (D.7). Step (D.8) is obtained by observing that, by (143), 
Cov[YM,i,Y'M,j] = when F G 3^o,o- The last step (D.9) follows from the CLT for sums 
of mean, unit variance, i.i.d random variables and ( |144 ). This concludes the proof of 
Theorem 13.31 



To show Theorem 



5.3 



observe that under the logarithmic growth condition k = 0((log(M))^/*^^~'^^) 
g2{k,M) = o(l) and gi{k,M) = 1 + o(l) by assumption Since GN,Bc(^k) = (GAr(ffc) — 
gi{k, M))/g2{k, M), it follows that the asymptotic distribution of 

gnM ^) -nG M,Bcm 

V[G^,Bc(ffc)] 
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is equal to the asymptotic distribution of GAr(ffc) = (G^vlffe) — ^[^N{ik)])/{y^[^N{ik)])- 

□ 



E.4 Berry-Esseen bounds 

We now establish Berry-Esseen bounds for the case where ^ — )■ 0. In particular, we assume 
that there exists a 6 : < 6 < 1, such that = 0{M^). We also assume that the 
interchangeable process has finite absolute third order moment E{\ZM,if) = Pm < oo VM. 

E.4.1 Details 

Define the subset F of F as follows: F — F — {Fm,5,M U I^(t,s,m}- 
We recognize that for F e F, we have 

-V^aiM) < a(F) < y/5,(M). 

The mean and variance of Ym.i under the distribution F are given by m{F) and (t{F) + p — 
w?{F) respectively. 

As in the previous section, let (j) be the distribution function of a Gaussian random variable 
with mean and p variance. 



Lower bound 



Pr{^N,M ^ = / PrF{^N,M < a}dp{F) 

> J PrF{SN,M < a}dp{F) 

a - VNm{F) \ Ck{F) 







1 + ((7(F) - Tm?{F))lp j {a{F) +p- w?{F)f VN 

a-^N6^{M) \ ^^^^^ f Ck{F) 
1 + 



dp{F) 



> \ ' /i(F) - / ^ ' ^dp{F) 



^ ^ - ^N6,{M) \ Ck 
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Upper bound 

Denote ^{r'^) ■= ft- We note that jl < f^m,s,M + fJ^a,s,M- 



Pr{SN,M <a}= I PrF{SN,M < a}dn{F) 
< J PrF{SN,M < a}dfx{F) + fl 
a-y/Nm{F) 







Ck{F) 



1 + (a(F) - m2(F))/p / + p - m^{F) f VN 



<<t> 
<0 



a+^N5^{M) 



< 



l-{./s;(M) + 6,{M))/p^ 

a - ^NS,{M) 
l-{y/djM) + S,{M))/p^ 



/^(r) + 



Ck{F) 



Ck 



dp{F) + jl 
dp{F) + fl 



{p+,/s;{M)r^/N 

Ck 



1 1 



We have shown that the appropriately normahzed sum Sn,m converges in distribution to a 
normal random variable. Also for the case where N grows slower than M, we have established 
Berry-Esseen type bounds on the error. 



F Uniform kernel based plug-in estimator 

In this section, we will state the main results concerning uniform kernel plug-in estimators. 
The proofs for these results rely on the properties of the uniform kernel density estimates 
established in Appendix A and proofs for equivalent results for the /c-NN plug-in estimators. 
Let fu denote the boundary corrected uniform kernel density estimate. Denote the uniform 
kernel plug-in estimator by 




Let Y denote a random variable with density function /. 
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F.l Results 

Corollary F.l. Suppose that the density f, the functional g and the density estimate 
satisfy the necessary conditions listed above. The bias of the plug-in estimator Guif) is then 
given by 



BuU) = 




where Ci = E[^'(/(Y), Y)c(Y)], C2 = E[/(/(Y), Y)/(Y)/2] are constants which depend on 
the underlying density f . 

Corollary F.2. Suppose that the density f , the functional g and the density estimate f„ 
satisfy the necessary conditions listed above. The variance of the plug-in estimator Gu{f) is 
given by 



V.(/) = 




where C4 = V[g'(/(Y), Y)] and C5 = V[/(Y)gi'(/(Y), Y)] are constants which depend on the 
underlying density f. 

Corollary F.3. Suppose that the density f, the functional g and the density estimate sat- 
isfy the necessary conditions listed above. Further suppose 'E[\g{f)\^] is finite. The asymptotic 
distribution of the plug-in estimator G„(/) is given by 

„ / G..(/)-E[G„(/)] _\ _ 
A(k,N.M)^o \^Ylf(Y)g'(f(Y),Y)]/N J 

where Z is a standard normal random variable. 
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